Recognition: 1 theorem link
· Lean TheoremVideo-LLaVA: Learning United Visual Representation by Alignment Before Projection
Pith reviewed 2026-05-14 18:00 UTC · model grok-4.3
The pith
By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits, and outperforms Video-ChatGPT by 5.8 percent, 9.9 percent, 18.6 percent, and 10.1 percent on MSRVTT, MSVD, TGIF, and ActivityNet respectively.
What carries the argument
Alignment before projection, the step that places image and video features into a common language feature space prior to the LLM projection layers so that a single model can learn from mixed data.
If this is right
- A single model trained on mixed image-video data outperforms models built specifically for images on nine image benchmarks.
- The same model outperforms Video-ChatGPT by 5.8 to 18.6 percent on four standard video datasets.
- Images and videos improve each other's performance when processed inside one unified representation.
- A straightforward alignment step before projection is sufficient to create a working unified LVLM baseline.
Where Pith is reading between the lines
- The same pre-projection alignment idea could be tested with additional modalities such as audio or depth maps.
- If alignment before projection is the decisive factor, then future work could reduce emphasis on ever-more-complex projection layers.
- Scaling the mixed dataset size while keeping the unified representation fixed would test whether the mutual-benefit effect grows or saturates.
Load-bearing premise
The main difficulty for an LLM with multi-modal inputs is the absence of unified tokenization for images and videos before the projection layers are applied.
What would settle it
Train a non-unified model that still uses separate image and video encoders but receives the same mixed dataset and check whether it matches or exceeds Video-LLaVA on both image and video benchmarks.
read the original abstract
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Video-LLaVA, an LVLM that aligns image and video features into a shared language feature space prior to projection into the LLM. This unified representation enables joint training on mixed image-video datasets, yielding mutual performance gains. The model reports state-of-the-art results on 9 image benchmarks (across 5 QA datasets and 4 toolkits) and outperforms Video-ChatGPT by 5.8–18.6% on four video datasets (MSRVTT, MSVD, TGIF, ActivityNet).
Significance. If the empirical link between pre-projection alignment and the observed mutual enhancement holds, the work supplies a simple, reproducible baseline for unified LVLMs. The public code release strengthens the contribution by enabling direct verification of the mixed-training protocol and benchmark numbers.
major comments (3)
- [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.
- [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.
- [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.
minor comments (2)
- [Abstract] The abstract states '9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits'; the exact mapping between these counts and the tables in §4.1 should be clarified for reproducibility.
- [§3.1] Notation for the unified visual token space (e.g., the symbol used for the aligned feature before the LLM projector) is introduced inconsistently between §3.1 and Figure 2.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for minor revision. We address each major point below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.
Authors: We thank the referee for highlighting this omission. The alignment is performed with a contrastive loss between the visual features and language embeddings while jointly optimizing the encoders; the encoders are not frozen. We will revise Section 3 to include the precise loss formulation, optimization schedule, and training details so that the contribution of the alignment step can be more clearly isolated. revision: yes
-
Referee: [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.
Authors: We agree that a controlled ablation would strengthen the central claim. In the revised manuscript we will add an ablation study in Section 4.2 that directly compares the pre-projection unified alignment against (i) post-projection fusion and (ii) separate image/video projectors while keeping all other factors fixed. revision: yes
-
Referee: [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.
Authors: We acknowledge that variance statistics would be preferable. Due to the high computational cost of LVLM training we report single-run results, which is standard practice in the field. We will add a clarifying note in the revised paper stating this limitation and pointing out that the observed gains are consistent across four distinct video benchmarks and are accompanied by mutual improvements on image tasks, supporting attribution to the unified representation rather than hyper-parameter differences alone. revision: partial
Axiom & Free-Parameter Ledger
free parameters (1)
- projection and training hyperparameters
axioms (1)
- domain assumption Transformer-based LLMs can integrate aligned visual tokens effectively
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
-
WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language
WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
Grounding Video Reasoning in Physical Signals
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...
-
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels
UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
-
Spatio-Temporal Grounding of Large Language Models from Perception Streams
FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 whil...
-
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716--23736
work page 2022
-
[3]
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728--1738
work page 2021
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901
work page 2020
-
[6]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190--200
work page 2011
-
[8]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023)
work page 2023
-
[9]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. https://arxiv.org/abs/2305.06500 Instructblip: Towards general-purpose vision-language models with instruction tuning . Preprint, arXiv:2305.06500
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180--15190
work page 2023
-
[14]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904--6913
work page 2017
-
[15]
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608--3617
work page 2018
-
[17]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009
work page 2022
-
[19]
Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709
work page 2019
-
[21]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758--2766
work page 2017
-
[23]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583--5594. PMLR
work page 2021
-
[24]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. 2023. https://arxiv.org/abs/2306.16527 Obelics: An open web-scale filtered dataset of interleaved image-text documents . Preprint, arXiv:2306.16527
-
[27]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888--12900. PMLR
work page 2022
-
[28]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694--9705
work page 2021
-
[35]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507--2521
work page 2022
-
[39]
OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744
work page 2022
-
[42]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556--2565
work page 2018
-
[44]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326
work page 2019
-
[46]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model
work page 2023
-
[50]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296
work page 2016
-
[55]
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127--9134
work page 2019
-
[60]
Improved Baselines with Visual Instruction Tuning
Improved Baselines with Visual Instruction Tuning , author=. arXiv preprint arXiv:2310.03744 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Visual instruction tuning , author=. arXiv preprint arXiv:2304.08485 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. arXiv preprint arXiv:2306.05424 , year=
work page internal anchor Pith review arXiv
-
[63]
VideoChat: Chat-Centric Video Understanding
Videochat: Chat-centric video understanding , author=. arXiv preprint arXiv:2305.06355 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
arXiv preprint arXiv:2306.07207 , year=
Valley: Video Assistant with Large Language model Enhanced abilitY , author=. arXiv preprint arXiv:2306.07207 , year=
-
[65]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
- [66]
-
[67]
Stanford alpaca: An instruction-following llama model , author=
-
[68]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=
work page 2023
-
[71]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[72]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[73]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual chatgpt: Talking, drawing and editing with visual foundation models , author=. arXiv preprint arXiv:2303.04671 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Mm-react: Prompting chatgpt for multimodal reasoning and action , author=. arXiv preprint arXiv:2303.11381 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Vipergpt: Visual infer- ence via python execution for reasoning
Vipergpt: Visual inference via python execution for reasoning , author=. arXiv preprint arXiv:2303.08128 , year=
-
[77]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=
work page 2023
-
[78]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv preprint arXiv:2303.16199 , year=
-
[82]
Llama-adapter v2: Parameter-efficient vi- sual instruction model
Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv preprint arXiv:2304.15010 , year=
-
[83]
arXiv preprint arXiv:2309.03905 , year=
Imagebind-llm: Multi-modality instruction tuning , author=. arXiv preprint arXiv:2309.03905 , year=
-
[84]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[85]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[86]
International Conference on Machine Learning , pages=
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[87]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[88]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[89]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[90]
Advances in Neural Information Processing Systems , volume=
Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , volume=
-
[91]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[92]
Evaluating Object Hallucination in Large Vision-Language Models
Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[93]
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[94]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[95]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[96]
Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=
-
[97]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[98]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[99]
arXiv preprint arXiv:2004.07159 , year=
Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation , author=. arXiv preprint arXiv:2004.07159 , year=
-
[100]
Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[101]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[102]
Advances in Neural Information Processing Systems , volume=
Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=
-
[103]
Advances in neural information processing systems , volume=
Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=
-
[104]
International Conference on Machine Learning , pages=
Vilt: Vision-and-language transformer without convolution or region supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[105]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[106]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=
work page internal anchor Pith review arXiv
-
[107]
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. arXiv preprint arXiv:2310.01852 , year=
-
[108]
arXiv preprint arXiv:2305.04790 , year=
Multimodal-gpt: A vision and language model for dialogue with humans , author=. arXiv preprint arXiv:2305.04790 , year=
-
[109]
arXiv preprint arXiv:2311.08046 , year=
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. arXiv preprint arXiv:2311.08046 , year=
-
[110]
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=
work page 2023
-
[111]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[112]
Proceedings of the IEEE international conference on computer vision , pages=
Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[113]
arXiv preprint arXiv:2305.04160 , year=
X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages , author=. arXiv preprint arXiv:2305.04160 , year=
-
[114]
arXiv preprint arXiv:2306.09093 , year=
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration , author=. arXiv preprint arXiv:2306.09093 , year=
-
[115]
Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.