mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Anwen Hu; Chenliang Li; Fei Huang; Guohai Xu; Haiyang Xu; Hehong Chen; Jiabo Ye; Jingren Zhou; Ji Zhang; Junfeng Tian

arxiv: 2304.14178 · v3 · pith:AT6IIIBKnew · submitted 2023-04-27 · 💻 cs.CL · cs.CV· cs.LG

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye , Haiyang Xu , Guohai Xu , Jiabo Ye , Ming Yan , Yiyang Zhou , Junyang Wang , Anwen Hu

show 10 more authors

Pengcheng Shi Yaya Shi Chenliang Li Yuanhong Xu Hehong Chen Junfeng Tian Qi Qian Ji Zhang Fei Huang Jingren Zhou

This is my paper

Pith reviewed 2026-05-24 09:00 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.LG

keywords multimodal large language modelsmodular trainingvisual instruction tuningLoRA adaptationimage-text alignmentmulti-turn conversationknowledge reasoning

0 comments

The pith

mPLUG-Owl equips large language models with multimodal abilities by training separate visual knowledge and abstractor modules while keeping the core LLM mostly frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a modular training method that adds image understanding to existing large language models. It splits the system into a foundation LLM, a visual knowledge module, and a visual abstractor, then aligns them in two stages. The first stage trains the visual parts with the LLM frozen; the second stage applies low-rank adaptation to the LLM and abstractor on mixed language and multimodal data. This setup is shown to support instruction following, multi-turn dialogue, and knowledge reasoning while also producing unexpected skills such as relating multiple images or reading scene text.

Core claim

A two-stage modular procedure—freezing the LLM while training visual modules to align images with text, then jointly tuning a LoRA module on the LLM and abstractor—adds visual capabilities to LLMs without degrading their original language generation performance and yields stronger results than prior multimodal models on instruction and reasoning tasks.

What carries the argument

The two-stage modular training procedure that freezes the LLM in stage one and applies LoRA adaptation to the LLM and abstractor in stage two.

If this is right

The model supports multiple modalities through collaboration between the visual and language modules.
It demonstrates multi-turn conversation and knowledge reasoning abilities on visually related instructions.
Unexpected capabilities emerge, including multi-image correlation and scene text understanding.
These abilities open the possibility of vision-only document comprehension in real scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular split could be tested on non-visual modalities such as audio or video by swapping the knowledge module.
If the visual abstractor generalizes, it might reduce the need for full retraining when new image encoders become available.
The two-stage process might serve as a template for adding capabilities to other frozen foundation models beyond vision.

Load-bearing premise

The assumption that freezing the LLM during visual alignment and later using low-rank adaptation will successfully add image understanding without harming the model's language abilities.

What would settle it

A direct comparison showing that the trained model scores lower than the original unfrozen LLM on standard language-only benchmarks such as MMLU or GSM8K.

Figures

Figures reproduced from arXiv: 2304.14178 by Anwen Hu, Chenliang Li, Fei Huang, Guohai Xu, Haiyang Xu, Hehong Chen, Jiabo Ye, Jingren Zhou, Ji Zhang, Junfeng Tian, Junyang Wang, Ming Yan, Pengcheng Shi, Qinghao Ye, Qi Qian, Yaya Shi, Yiyang Zhou, Yuanhong Xu.

**Figure 2.** Figure 2: Our training paradigm and model overview. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The comparison between mPLUG-Owl and baselines on OwlEval with manual evaluation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: The comparison results of 50 single-turn responses (left) and 52 multi-turn responses [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A comparison of Knowledge-intensive QA [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: A comparison of Multi-turn Conversation. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: A comparison of Reasoning QA [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: A comparison of Joke Understanding. The image features a funny scene of an animated squirrel taking a picture while holding a digital camera. Squirrels are known for their mischievous behavior, so seeing one using a camera to capture their antics could amuse viewers. Additionally, the scene is presented with an artistic style that adds to the comical atmosphere, making it even funnier to view. The juxtapos… view at source ↗

**Figure 9.** Figure 9: More cases of Jokes Comprehension by mPLUG-Owl. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Multi-image correlation cases. Multi-image Correlation In [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Example prompt of multilingual understanding which showcases the multilingual abili [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Examples about various document understanding and application. [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 13.** Figure 13: Open-ended creation cases. 6 Conclusion We propose mPLUG-Owl, a novel training paradigm that enhances the multi-modal abilities of large language models (LLMs). Our approach consists of modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module, which can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaborat… view at source ↗

**Figure 14.** Figure 14: Copywriting cases. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: The comparison results which exclude the cases that were generated unsuccessfully by [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: OCR of simple scenes (mostly scenes with few numbers and no calculation a). [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: OCR of complex scenes (a). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: OCR of complex scenes (b). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

read the original abstract

Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

mPLUG-Owl gives a modular recipe for adding vision to LLMs via separate modules and staged training, with code and eval set released, though the performance claims rest on details not shown in the abstract.

read the letter

The main thing to know is that mPLUG-Owl introduces a modular setup with a visual knowledge module and an abstractor to add vision capabilities to LLMs through two-stage training that freezes the base model first then applies LoRA. They do a solid job releasing the full code, pretrained models, instruction-tuned versions, and their OwlEval set on GitHub. That level of openness is useful. The training procedure is straightforward: align images and text with the LLM frozen, then jointly fine-tune with language and multimodal data while keeping the visual module fixed. This design aims to add new abilities without breaking existing language performance, and they mention some extra skills like multi-image correlation that emerged. The soft spots are mostly around the evidence. The abstract claims outperformance on instruction following, visual understanding, and reasoning but gives no numbers, no list of baselines, and no dataset details. Without those, it's impossible to tell if the gains are meaningful or if the method really preserves the original LLM strengths. The evaluation set is new, but its construction and difficulty aren't described enough to judge. This kind of paper is for people actively working on multimodal LLMs who need practical recipes for adding modalities. It is not a theoretical breakthrough but could serve as a reference implementation. I would send it to peer review so the experiments can be properly checked and the claims tested against the full data.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces mPLUG-Owl, a modular multimodal LLM that augments a foundation language model with a visual knowledge module and a visual abstractor. Training proceeds in two stages: stage 1 freezes the LLM and trains the visual modules on image-text alignment data; stage 2 freezes the visual knowledge module and jointly fine-tunes a LoRA adapter on the LLM together with the abstractor using both language-only and multimodal instruction data. The authors release code, pretrained and instruction-tuned models, and a new visually-oriented instruction benchmark (OwlEval). They report that mPLUG-Owl outperforms prior multimodal models on instruction following, visual understanding, multi-turn conversation, and knowledge reasoning, and exhibits emergent behaviors such as multi-image correlation and scene-text understanding.

Significance. If the reported gains are reproducible, the work supplies a practical, modular recipe for extending LLMs to vision while preserving language-generation quality. The public release of the full training pipeline, model weights, and evaluation set constitutes a concrete contribution that enables direct verification and extension by the community.

major comments (2)

[§4.1, Table 2] §4.1 and Table 2: the claim that mPLUG-Owl outperforms existing multimodal models is load-bearing for the central contribution, yet the manuscript provides no ablation that isolates the contribution of the modular two-stage schedule versus simply using the same instruction data with a non-modular baseline; without this comparison the attribution of gains to modularization remains untested.
[§3.2] §3.2: the assertion that the second-stage LoRA adaptation “maintains and even improves” the original LLM’s generation abilities is central to the modularization thesis, but the paper reports no zero-shot or few-shot language-only benchmarks (e.g., MMLU, BBH) comparing the final model against the unmodified base LLM; this omission leaves the preservation claim unsupported by direct evidence.

minor comments (3)

[Abstract] The abstract states performance improvements without any numeric values or baseline names; moving at least the headline numbers and the most important baseline into the abstract would improve readability.
[§4.1] OwlEval is introduced as a new evaluation set, yet the manuscript does not report inter-annotator agreement, dataset size, or construction protocol; these details belong in §4.1 or an appendix.
[§3] Notation for the visual abstractor and the LoRA modules is introduced without a consolidated table of symbols; adding such a table would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the evidence for our claims.

read point-by-point responses

Referee: [§4.1, Table 2] §4.1 and Table 2: the claim that mPLUG-Owl outperforms existing multimodal models is load-bearing for the central contribution, yet the manuscript provides no ablation that isolates the contribution of the modular two-stage schedule versus simply using the same instruction data with a non-modular baseline; without this comparison the attribution of gains to modularization remains untested.

Authors: We agree that an explicit ablation comparing the two-stage modular schedule against a non-modular baseline trained on the same instruction data would strengthen attribution of gains to modularization. In the revised manuscript we will add this comparison to isolate the contribution of our training paradigm. revision: yes
Referee: [§3.2] §3.2: the assertion that the second-stage LoRA adaptation “maintains and even improves” the original LLM’s generation abilities is central to the modularization thesis, but the paper reports no zero-shot or few-shot language-only benchmarks (e.g., MMLU, BBH) comparing the final model against the unmodified base LLM; this omission leaves the preservation claim unsupported by direct evidence.

Authors: We acknowledge that direct zero-shot and few-shot results on language-only benchmarks such as MMLU and BBH would provide stronger support for the preservation claim. We will add these evaluations comparing the final model to the base LLM in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical modular training procedure consisting of two independent stages (frozen-LLM visual alignment followed by LoRA fine-tuning on external instruction data) whose success is measured against external benchmarks and an author-constructed evaluation set. No equations, self-definitional mappings, fitted-input predictions, or load-bearing self-citations appear in the abstract or method description that would reduce any claimed capability to a quantity defined inside the paper itself. The derivation chain is therefore self-contained against external data and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no identifiable free parameters, axioms, or invented entities; no equations or modeling choices are detailed.

pith-pipeline@v0.9.0 · 5918 in / 1072 out tokens · 27435 ms · 2026-05-24T09:00:23.632595+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
AffectVerse: Emotional World Models for Multimodal Affective Computing
cs.CV 2026-05 unverdicted novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and...
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
cs.HC 2026-05 unverdicted novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-04 unverdicted novelty 7.0

DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
cs.CV 2024-12 unverdicted novelty 7.0

HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion p...
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
cs.CV 2024-11 unverdicted novelty 7.0

VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
cs.AI 2024-10 unverdicted novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
cs.AI 2024-07 accept novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
cs.CV 2023-10 unverdicted novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
cs.CV 2026-05 unverdicted novelty 6.0

A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
cs.CV 2026-04 conditional novelty 6.0

R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents
cs.AI 2025-12 unverdicted novelty 6.0

Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
cs.CV 2025-09 unverdicted novelty 6.0

ORCA is an inference-time agentic framework that boosts LVLM accuracy on hallucination benchmarks by 3.64-40.67% and adds adversarial robustness via cross-model validation with small vision tools.
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
cs.CV 2025-09 unverdicted novelty 6.0

ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to ...
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
cs.CL 2025-03 unverdicted novelty 6.0

A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
When Large Vision-Language Models Meet Person Re-Identification
cs.CV 2024-11 unverdicted novelty 6.0

LVLM-ReID guides LVLMs to produce refined semantic tokens as pedestrian identity features for ReID, achieving competitive benchmark results without additional image-text data.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
cs.CV 2024-10 unverdicted novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
cs.CV 2024-10 accept novelty 6.0

SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
VideoPhy: Evaluating Physical Commonsense for Video Generation
cs.CV 2024-06 conditional novelty 6.0

VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
TempCompass: Do Video LLMs Really Understand Videos?
cs.CV 2024-03 unverdicted novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
cs.CL 2024-01 conditional novelty 6.0

Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
cs.CV 2024-01 conditional novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
cs.HC 2024-01 unverdicted novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
cs.CV 2023-11 accept novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
An Embodied Generalist Agent in 3D World
cs.CV 2023-11 unverdicted novelty 6.0

LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on ...
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
cs.CV 2023-10 unverdicted novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
cs.LG 2023-10 conditional novelty 6.0

LURE reduces object hallucination in LVLMs by 23% via post-hoc revision informed by co-occurrence, uncertainty, and text position analysis.
Aligning Large Multimodal Models with Factually Augmented RLHF
cs.CV 2023-09 conditional novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
cs.CV 2023-06 accept novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
cs.CV 2025-06 unverdicted novelty 5.0

ReVisiT refines LVLM output distributions during decoding by projecting selected vision tokens into text space via context-aware constrained divergence minimization.
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
cs.CV 2025-05 unverdicted novelty 5.0

CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model
eess.IV 2025-04 unverdicted novelty 5.0

Q-Agent uses CoT decomposition on a fine-tuned MLLM for multi-degradation perception plus IQA-driven greedy selection of restoration algorithms to claim better performance than All-in-One IR models.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
cs.CV 2025-01 unverdicted novelty 5.0

LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
cs.CV 2024-08 unverdicted novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
cs.LG 2024-02 unverdicted novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
cs.CV 2023-12 unverdicted novelty 5.0

MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetso...
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
cs.CV 2023-11 unverdicted novelty 5.0

SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
cs.CL 2023-11 unverdicted novelty 5.0

mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
cs.CV 2023-10 unverdicted novelty 5.0

MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.
Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 4.0

An OCR-aware multilingual framework combining synthetic data generation, LoRA SFT, and visual CoT prompting improves text extraction and translation robustness in multimodal LLMs on degraded images.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 71 Pith papers · 15 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model f...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Y . Zhao, Y . Huang, A. M. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei. Scaling instruction-finetuned langu...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model. CoRR, abs/2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

15 H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.CoRR, abs/2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V . del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V . Sanh, H. Lauren...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. LAION-400M: open dataset of clip-filtered 400 million image-text pairs. CoRR, abs/2111.02114,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

doi: 10.48550/arXiv.2212.10560. URL https://doi.org/10.48550/arXiv.2212.10560. C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.10560
[16]

C. Xu, D. Guo, N. Duan, and J. J. McAuley. Baize: An open-source chat model with parameter- efficient tuning on self-chat data. CoRR, abs/2304.01196, 2023a. H. Xu, M. Yan, C. Li, B. Bi, S. Huang, W. Xiao, and F. Huang. E2E-VLP: end-to-end vision- language pre-training enhanced by visual learning. In ACL/IJCNLP (1), pages 503–513. Associ- ation for Computa...

work page arXiv
[17]

16 H. Xu, Q. Ye, M. Yan, Y . Shi, J. Ye, Y . Xu, C. Li, B. Bi, Q. Qian, W. Wang, G. Xu, J. Zhang, S. Huang, F. Huang, and J. Zhou. mplug-2: A modularized multi-modal foundation model across text, image and video. CoRR, abs/2302.00402, 2023b. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang. MM-REACT: prompting ...

work page arXiv
[18]

Q. Ye, G. Xu, M. Yan, H. Xu, Q. Qian, J. Zhang, and F. Huang. Hitea: Hierarchical temporal-aware video-language pre-training. CoRR, abs/2212.14546,

work page arXiv
[19]

URL https://doi.org/10.48550/arXiv.2212.14546

doi: 10.48550/arXiv.2212.14546. URL https://doi.org/10.48550/arXiv.2212.14546. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. T. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068,

work page doi:10.48550/arxiv.2212.14546
[20]

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language under- standing with advanced large language models, 2023a. W. Zhu, J. Hessel, A. Awadalla, S. Y . Gadre, J. Dodge, A. Fang, Y . Yu, L. Schmidt, W. Y . Wang, and Y . Choi. Multimodal C4: an open, billion-scale corpus of images interleaved with text. CoRR, abs/2304.0693...

work page arXiv

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning

J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model f...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev,...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Y . Zhao, Y . Huang, A. M. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei. Scaling instruction-finetuned langu...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model. CoRR, abs/2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

15 H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.CoRR, abs/2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V . del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V . Sanh, H. Lauren...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. LAION-400M: open dataset of clip-filtered 400 million image-text pairs. CoRR, abs/2111.02114,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

doi: 10.48550/arXiv.2212.10560. URL https://doi.org/10.48550/arXiv.2212.10560. C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.10560

[16] [16]

C. Xu, D. Guo, N. Duan, and J. J. McAuley. Baize: An open-source chat model with parameter- efficient tuning on self-chat data. CoRR, abs/2304.01196, 2023a. H. Xu, M. Yan, C. Li, B. Bi, S. Huang, W. Xiao, and F. Huang. E2E-VLP: end-to-end vision- language pre-training enhanced by visual learning. In ACL/IJCNLP (1), pages 503–513. Associ- ation for Computa...

work page arXiv

[17] [17]

16 H. Xu, Q. Ye, M. Yan, Y . Shi, J. Ye, Y . Xu, C. Li, B. Bi, Q. Qian, W. Wang, G. Xu, J. Zhang, S. Huang, F. Huang, and J. Zhou. mplug-2: A modularized multi-modal foundation model across text, image and video. CoRR, abs/2302.00402, 2023b. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang. MM-REACT: prompting ...

work page arXiv

[18] [18]

Q. Ye, G. Xu, M. Yan, H. Xu, Q. Qian, J. Zhang, and F. Huang. Hitea: Hierarchical temporal-aware video-language pre-training. CoRR, abs/2212.14546,

work page arXiv

[19] [19]

URL https://doi.org/10.48550/arXiv.2212.14546

doi: 10.48550/arXiv.2212.14546. URL https://doi.org/10.48550/arXiv.2212.14546. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. T. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068,

work page doi:10.48550/arxiv.2212.14546

[20] [20]

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language under- standing with advanced large language models, 2023a. W. Zhu, J. Hessel, A. Awadalla, S. Y . Gadre, J. Dodge, A. Fang, Y . Yu, L. Schmidt, W. Y . Wang, and Y . Choi. Multimodal C4: an open, billion-scale corpus of images interleaved with text. CoRR, abs/2304.0693...

work page arXiv