mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Pith reviewed 2026-05-24 09:00 UTC · model grok-4.3
The pith
mPLUG-Owl equips large language models with multimodal abilities by training separate visual knowledge and abstractor modules while keeping the core LLM mostly frozen.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A two-stage modular procedure—freezing the LLM while training visual modules to align images with text, then jointly tuning a LoRA module on the LLM and abstractor—adds visual capabilities to LLMs without degrading their original language generation performance and yields stronger results than prior multimodal models on instruction and reasoning tasks.
What carries the argument
The two-stage modular training procedure that freezes the LLM in stage one and applies LoRA adaptation to the LLM and abstractor in stage two.
If this is right
- The model supports multiple modalities through collaboration between the visual and language modules.
- It demonstrates multi-turn conversation and knowledge reasoning abilities on visually related instructions.
- Unexpected capabilities emerge, including multi-image correlation and scene text understanding.
- These abilities open the possibility of vision-only document comprehension in real scenarios.
Where Pith is reading between the lines
- The same modular split could be tested on non-visual modalities such as audio or video by swapping the knowledge module.
- If the visual abstractor generalizes, it might reduce the need for full retraining when new image encoders become available.
- The two-stage process might serve as a template for adding capabilities to other frozen foundation models beyond vision.
Load-bearing premise
The assumption that freezing the LLM during visual alignment and later using low-rank adaptation will successfully add image understanding without harming the model's language abilities.
What would settle it
A direct comparison showing that the trained model scores lower than the original unfrozen LLM on standard language-only benchmarks such as MMLU or GSM8K.
Figures
read the original abstract
Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces mPLUG-Owl, a modular multimodal LLM that augments a foundation language model with a visual knowledge module and a visual abstractor. Training proceeds in two stages: stage 1 freezes the LLM and trains the visual modules on image-text alignment data; stage 2 freezes the visual knowledge module and jointly fine-tunes a LoRA adapter on the LLM together with the abstractor using both language-only and multimodal instruction data. The authors release code, pretrained and instruction-tuned models, and a new visually-oriented instruction benchmark (OwlEval). They report that mPLUG-Owl outperforms prior multimodal models on instruction following, visual understanding, multi-turn conversation, and knowledge reasoning, and exhibits emergent behaviors such as multi-image correlation and scene-text understanding.
Significance. If the reported gains are reproducible, the work supplies a practical, modular recipe for extending LLMs to vision while preserving language-generation quality. The public release of the full training pipeline, model weights, and evaluation set constitutes a concrete contribution that enables direct verification and extension by the community.
major comments (2)
- [§4.1, Table 2] §4.1 and Table 2: the claim that mPLUG-Owl outperforms existing multimodal models is load-bearing for the central contribution, yet the manuscript provides no ablation that isolates the contribution of the modular two-stage schedule versus simply using the same instruction data with a non-modular baseline; without this comparison the attribution of gains to modularization remains untested.
- [§3.2] §3.2: the assertion that the second-stage LoRA adaptation “maintains and even improves” the original LLM’s generation abilities is central to the modularization thesis, but the paper reports no zero-shot or few-shot language-only benchmarks (e.g., MMLU, BBH) comparing the final model against the unmodified base LLM; this omission leaves the preservation claim unsupported by direct evidence.
minor comments (3)
- [Abstract] The abstract states performance improvements without any numeric values or baseline names; moving at least the headline numbers and the most important baseline into the abstract would improve readability.
- [§4.1] OwlEval is introduced as a new evaluation set, yet the manuscript does not report inter-annotator agreement, dataset size, or construction protocol; these details belong in §4.1 or an appendix.
- [§3] Notation for the visual abstractor and the LoRA modules is introduced without a consolidated table of symbols; adding such a table would aid readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [§4.1, Table 2] §4.1 and Table 2: the claim that mPLUG-Owl outperforms existing multimodal models is load-bearing for the central contribution, yet the manuscript provides no ablation that isolates the contribution of the modular two-stage schedule versus simply using the same instruction data with a non-modular baseline; without this comparison the attribution of gains to modularization remains untested.
Authors: We agree that an explicit ablation comparing the two-stage modular schedule against a non-modular baseline trained on the same instruction data would strengthen attribution of gains to modularization. In the revised manuscript we will add this comparison to isolate the contribution of our training paradigm. revision: yes
-
Referee: [§3.2] §3.2: the assertion that the second-stage LoRA adaptation “maintains and even improves” the original LLM’s generation abilities is central to the modularization thesis, but the paper reports no zero-shot or few-shot language-only benchmarks (e.g., MMLU, BBH) comparing the final model against the unmodified base LLM; this omission leaves the preservation claim unsupported by direct evidence.
Authors: We acknowledge that direct zero-shot and few-shot results on language-only benchmarks such as MMLU and BBH would provide stronger support for the preservation claim. We will add these evaluations comparing the final model to the base LLM in the revision. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents an empirical modular training procedure consisting of two independent stages (frozen-LLM visual alignment followed by LoRA fine-tuning on external instruction data) whose success is measured against external benchmarks and an author-constructed evaluation set. No equations, self-definitional mappings, fitted-input predictions, or load-bearing self-citations appear in the abstract or method description that would reduce any claimed capability to a quantity defined inside the paper itself. The derivation chain is therefore self-contained against external data and evaluation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
AffectVerse: Emotional World Models for Multimodal Affective Computing
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and...
-
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
-
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
-
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion p...
-
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
-
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
-
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
-
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
-
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
ORCA is an inference-time agentic framework that boosts LVLM accuracy on hallucination benchmarks by 3.64-40.67% and adds adversarial robustness via cross-model validation with small vision tools.
-
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to ...
-
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
-
When Large Vision-Language Models Meet Person Re-Identification
LVLM-ReID guides LVLMs to produce refined semantic tokens as pedestrian identity features for ReID, achieving competitive benchmark results without additional image-text data.
-
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
-
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
-
VideoPhy: Evaluating Physical Commonsense for Video Generation
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
-
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
An Embodied Generalist Agent in 3D World
LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on ...
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
LURE reduces object hallucination in LVLMs by 23% via post-hoc revision informed by co-occurrence, uncertainty, and text position analysis.
-
Aligning Large Multimodal Models with Factually Augmented RLHF
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
-
Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
ReVisiT refines LVLM output distributions during decoding by projecting selected vision tokens into text space via context-aware constrained divergence minimization.
-
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
-
Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model
Q-Agent uses CoT decomposition on a fine-tuned MLLM for multi-degradation perception plus IQA-driven greedy selection of restoration algorithms to claim better performance than All-in-One IR models.
-
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
-
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
-
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetso...
-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.
-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
-
Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
An OCR-aware multilingual framework combining synthetic data generation, LoRA SFT, and visual CoT prompting improves text extraction and translation robustness in multimodal LLMs on degraded images.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model f...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Y . Zhao, Y . Huang, A. M. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei. Scaling instruction-finetuned langu...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model. CoRR, abs/2303.03378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
15 H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.CoRR, abs/2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Training language models to follow instructions with human feedback
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V . del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V . Sanh, H. Lauren...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. LAION-400M: open dataset of clip-filtered 400 million image-text pairs. CoRR, abs/2111.02114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
doi: 10.48550/arXiv.2212.10560. URL https://doi.org/10.48550/arXiv.2212.10560. C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.10560
-
[16]
C. Xu, D. Guo, N. Duan, and J. J. McAuley. Baize: An open-source chat model with parameter- efficient tuning on self-chat data. CoRR, abs/2304.01196, 2023a. H. Xu, M. Yan, C. Li, B. Bi, S. Huang, W. Xiao, and F. Huang. E2E-VLP: end-to-end vision- language pre-training enhanced by visual learning. In ACL/IJCNLP (1), pages 503–513. Associ- ation for Computa...
-
[17]
16 H. Xu, Q. Ye, M. Yan, Y . Shi, J. Ye, Y . Xu, C. Li, B. Bi, Q. Qian, W. Wang, G. Xu, J. Zhang, S. Huang, F. Huang, and J. Zhou. mplug-2: A modularized multi-modal foundation model across text, image and video. CoRR, abs/2302.00402, 2023b. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang. MM-REACT: prompting ...
- [18]
-
[19]
URL https://doi.org/10.48550/arXiv.2212.14546
doi: 10.48550/arXiv.2212.14546. URL https://doi.org/10.48550/arXiv.2212.14546. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. T. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068,
-
[20]
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language under- standing with advanced large language models, 2023a. W. Zhu, J. Hessel, A. Awadalla, S. Y . Gadre, J. Dodge, A. Fang, Y . Yu, L. Schmidt, W. Y . Wang, and Y . Choi. Multimodal C4: an open, billion-scale corpus of images interleaved with text. CoRR, abs/2304.0693...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.