Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Pith reviewed 2026-05-14 18:00 UTC · model grok-4.3
The pith
By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits, and outperforms Video-ChatGPT by 5.8 percent, 9.9 percent, 18.6 percent, and 10.1 percent on MSRVTT, MSVD, TGIF, and ActivityNet respectively.
What carries the argument
Alignment before projection, the step that places image and video features into a common language feature space prior to the LLM projection layers so that a single model can learn from mixed data.
If this is right
- A single model trained on mixed image-video data outperforms models built specifically for images on nine image benchmarks.
- The same model outperforms Video-ChatGPT by 5.8 to 18.6 percent on four standard video datasets.
- Images and videos improve each other's performance when processed inside one unified representation.
- A straightforward alignment step before projection is sufficient to create a working unified LVLM baseline.
Where Pith is reading between the lines
- The same pre-projection alignment idea could be tested with additional modalities such as audio or depth maps.
- If alignment before projection is the decisive factor, then future work could reduce emphasis on ever-more-complex projection layers.
- Scaling the mixed dataset size while keeping the unified representation fixed would test whether the mutual-benefit effect grows or saturates.
Load-bearing premise
The main difficulty for an LLM with multi-modal inputs is the absence of unified tokenization for images and videos before the projection layers are applied.
What would settle it
Train a non-unified model that still uses separate image and video encoders but receives the same mixed dataset and check whether it matches or exceeds Video-LLaVA on both image and video benchmarks.
read the original abstract
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Video-LLaVA, an LVLM that aligns image and video features into a shared language feature space prior to projection into the LLM. This unified representation enables joint training on mixed image-video datasets, yielding mutual performance gains. The model reports state-of-the-art results on 9 image benchmarks (across 5 QA datasets and 4 toolkits) and outperforms Video-ChatGPT by 5.8–18.6% on four video datasets (MSRVTT, MSVD, TGIF, ActivityNet).
Significance. If the empirical link between pre-projection alignment and the observed mutual enhancement holds, the work supplies a simple, reproducible baseline for unified LVLMs. The public code release strengthens the contribution by enabling direct verification of the mixed-training protocol and benchmark numbers.
major comments (3)
- [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.
- [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.
- [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.
minor comments (2)
- [Abstract] The abstract states '9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits'; the exact mapping between these counts and the tables in §4.1 should be clarified for reproducibility.
- [§3.1] Notation for the unified visual token space (e.g., the symbol used for the aligned feature before the LLM projector) is introduced inconsistently between §3.1 and Figure 2.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for minor revision. We address each major point below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.
Authors: We thank the referee for highlighting this omission. The alignment is performed with a contrastive loss between the visual features and language embeddings while jointly optimizing the encoders; the encoders are not frozen. We will revise Section 3 to include the precise loss formulation, optimization schedule, and training details so that the contribution of the alignment step can be more clearly isolated. revision: yes
-
Referee: [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.
Authors: We agree that a controlled ablation would strengthen the central claim. In the revised manuscript we will add an ablation study in Section 4.2 that directly compares the pre-projection unified alignment against (i) post-projection fusion and (ii) separate image/video projectors while keeping all other factors fixed. revision: yes
-
Referee: [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.
Authors: We acknowledge that variance statistics would be preferable. Due to the high computational cost of LVLM training we report single-run results, which is standard practice in the field. We will add a clarifying note in the revised paper stating this limitation and pointing out that the observed gains are consistent across four distinct video benchmarks and are accompanied by mutual improvements on image tasks, supporting attribution to the unified representation rather than hyper-parameter differences alone. revision: partial
Axiom & Free-Parameter Ledger
free parameters (1)
- projection and training hyperparameters
axioms (1)
- domain assumption Transformer-based LLMs can integrate aligned visual tokens effectively
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection
RobustSora benchmark demonstrates that current AI video detectors rely heavily on visible watermarks, with average accuracy drops of 6.6 percentage points when watermarks are erased and increased false alarms when wat...
-
AffectVerse: Emotional World Models for Multimodal Affective Computing
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and...
-
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
-
WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language
WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
Grounding Video Reasoning in Physical Signals
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...
-
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.
-
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
-
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
StreamGaze is a new benchmark and QA generation pipeline that measures how well MLLMs leverage gaze trajectories for temporal reasoning and proactive intention prediction in streaming egocentric videos.
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
-
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
-
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
-
Dynamic Model Merging Made Slim
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels
UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
-
Spatio-Temporal Grounding of Large Language Models from Perception Streams
FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 whil...
-
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
-
Structure Over Scale: Learning Visual Reasoning from Pedagogical Video
Fine-tuning VLMs on 10K QA pairs from pedagogical children's videos produces consistent gains on NExT-QA, Video-MME, and MotionBench, indicating that explicit structure can substitute for data scale.
-
Streaming Video Instruction Tuning
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
-
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
-
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
GA2-CLIP uses generic attribute anchors and coupled hard-soft prompts to preserve generalization in prompt-tuned video-language models on base-to-new class tasks.
-
Boosting Reasoning in Large Multimodal Models via Activation Replay
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
-
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval
LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
-
FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO
FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
-
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-w...
-
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
-
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
-
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
-
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
-
Enhancing Speech Large Language Models through Reinforced Behavior Alignment
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken ...
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.
-
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
-
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716--23736
work page 2022
-
[3]
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728--1738
work page 2021
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901
work page 2020
-
[6]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190--200
work page 2011
-
[8]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023)
work page 2023
-
[9]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. https://arxiv.org/abs/2305.06500 Instructblip: Towards general-purpose vision-language models with instruction tuning . Preprint, arXiv:2305.06500
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180--15190
work page 2023
-
[14]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904--6913
work page 2017
-
[15]
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608--3617
work page 2018
-
[17]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009
work page 2022
-
[19]
Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709
work page 2019
-
[21]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758--2766
work page 2017
-
[23]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583--5594. PMLR
work page 2021
-
[24]
Obelisc: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. 2023. https://arxiv.org/abs/2306.16527 Obelics: An open web-scale filtered dataset of interleaved image-text documents . Preprint, arXiv:2306.16527
-
[27]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888--12900. PMLR
work page 2022
-
[28]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694--9705
work page 2021
-
[35]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507--2521
work page 2022
-
[39]
OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744
work page 2022
-
[42]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556--2565
work page 2018
-
[44]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326
work page 2019
-
[46]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model
work page 2023
-
[50]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296
work page 2016
-
[55]
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127--9134
work page 2019
-
[60]
Improved Baselines with Visual Instruction Tuning
Improved Baselines with Visual Instruction Tuning , author=. arXiv preprint arXiv:2310.03744 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Visual instruction tuning , author=. arXiv preprint arXiv:2304.08485 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. arXiv preprint arXiv:2306.05424 , year=
work page internal anchor Pith review arXiv
-
[63]
VideoChat: Chat-Centric Video Understanding
Videochat: Chat-centric video understanding , author=. arXiv preprint arXiv:2305.06355 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,
Valley: Video Assistant with Large Language model Enhanced abilitY , author=. arXiv preprint arXiv:2306.07207 , year=
-
[65]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
- [66]
-
[67]
Stanford alpaca: An instruction-following llama model , author=
-
[68]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=
work page 2023
-
[71]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[72]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[73]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual chatgpt: Talking, drawing and editing with visual foundation models , author=. arXiv preprint arXiv:2303.04671 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Mm-react: Prompting chatgpt for multimodal reasoning and action , author=. arXiv preprint arXiv:2303.11381 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
ViperGPT: Visual Inference via Python Execution for Reasoning
Vipergpt: Visual inference via python execution for reasoning , author=. arXiv preprint arXiv:2303.08128 , year=
work page internal anchor Pith review arXiv
-
[77]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=
work page 2023
-
[78]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv preprint arXiv:2303.16199 , year=
-
[82]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv preprint arXiv:2304.15010 , year=
work page internal anchor Pith review arXiv
-
[83]
Imagebind-llm: Multi-modality instruction tun- ing
Imagebind-llm: Multi-modality instruction tuning , author=. arXiv preprint arXiv:2309.03905 , year=
-
[84]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[85]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[86]
International Conference on Machine Learning , pages=
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[87]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[88]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[89]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[90]
Advances in Neural Information Processing Systems , volume=
Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , volume=
-
[91]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[92]
Evaluating Object Hallucination in Large Vision-Language Models
Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[93]
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[94]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[95]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[96]
Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=
-
[97]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[98]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[99]
Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation , author=. arXiv preprint arXiv:2004.07159 , year=
-
[100]
Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[101]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[102]
Advances in Neural Information Processing Systems , volume=
Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=
-
[103]
Advances in neural information processing systems , volume=
Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=
-
[104]
International Conference on Machine Learning , pages=
Vilt: Vision-and-language transformer without convolution or region supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[105]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[106]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=
work page internal anchor Pith review arXiv
-
[107]
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. arXiv preprint arXiv:2310.01852 , year=
work page internal anchor Pith review arXiv
-
[108]
Multimodal-gpt: A vision and language model for dialogue with humans
Multimodal-gpt: A vision and language model for dialogue with humans , author=. arXiv preprint arXiv:2305.04790 , year=
-
[109]
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. arXiv preprint arXiv:2311.08046 , year=
-
[110]
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=
work page 2023
-
[111]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[112]
Proceedings of the IEEE international conference on computer vision , pages=
Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[113]
X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages , author=. arXiv preprint arXiv:2305.04160 , year=
-
[114]
Macaw-LLM : Multi-modal language modeling with image, audio, video, and text integration
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration , author=. arXiv preprint arXiv:2306.09093 , year=
-
[115]
Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.