MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Pith reviewed 2026-05-10 20:20 UTC · model grok-4.3
The pith
A new benchmark evaluates multimodal large language models on 14 perception and cognition subtasks using hand-designed questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a benchmark built from 14 subtasks can measure both perception and cognition abilities in multimodal large language models, that manually designed instruction-answer pairs prevent data leakage while keeping comparisons fair, and that evaluations of 30 existing models demonstrate substantial remaining gaps along with concrete directions for improvement.
What carries the argument
The MME benchmark, which consists of 14 subtasks split between perception and cognition, each using concise manually crafted instruction-answer pairs that support direct scoring without prompt engineering.
If this is right
- Models can be ranked on specific perception and cognition skills without the results depending on how prompts are worded.
- Weaknesses in particular subtasks become visible so optimization can target those gaps directly.
- Quantitative scores across many models become possible, revealing patterns that case studies alone do not show.
- Future model releases can be checked against the same fixed set of tasks for consistent progress tracking.
Where Pith is reading between the lines
- The benchmark could become a standard test set that new models are required to report on before publication.
- Training pipelines might incorporate the 14 subtasks as additional supervision signals to close the observed gaps.
- Similar hand-designed evaluation sets could be created for other multimodal domains such as video or audio.
Load-bearing premise
The hand-designed instruction-answer pairs are sufficient to block data leakage from existing public datasets and the short instructions produce fair comparisons across models without any prompt tuning.
What would settle it
A model achieving significantly higher scores on the same subtasks when given different or longer instructions, or evidence that the test pairs appear in the training data of evaluated models.
read the original abstract
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MME, the first comprehensive benchmark for Multimodal Large Language Models (MLLMs), comprising 14 subtasks that separately assess perception and cognition abilities. To mitigate data leakage, all instruction-answer pairs are manually designed rather than drawn from public datasets; concise, fixed instructions are used to enable direct, prompt-engineering-free comparisons across models. The authors evaluate 30 advanced MLLMs on the benchmark and conclude that substantial headroom remains for improvement in both perception and cognition.
Significance. If the no-leakage and instruction-invariance properties can be demonstrated, MME would supply a much-needed standardized yardstick for MLLM progress, analogous to GLUE or ImageNet in their respective domains. The public release of the data and the separation of perception versus cognition subtasks are concrete strengths that would allow the community to track targeted improvements.
major comments (3)
- [§3] §3 (Benchmark Construction): The claim that manually designed instruction-answer pairs eliminate data leakage is unsupported by any reported overlap audit, n-gram analysis, or membership inference check against the training corpora of the 30 evaluated MLLMs (e.g., LAION-5B, COCO, or VQAv2). Because every quantitative result rests on the assumption that the test pairs are unseen, this omission is load-bearing for the central validity claim.
- [§4.2] §4.2 (Model Evaluation): No ablation is presented that varies instruction phrasing while holding the underlying image-question pairs fixed. Without such evidence, the assertion that the chosen concise instructions remove prompt-engineering variance cannot be verified, directly affecting the fairness of the cross-model ranking.
- [§3.2] §3.2 (Annotation Process): Inter-annotator agreement statistics (e.g., Cohen’s κ or percentage agreement) are not reported for the manually created answer labels across the 14 subtasks. This is required to establish that the ground-truth answers are reliable rather than idiosyncratic to the annotators.
minor comments (3)
- [Table 1] Table 1: The column headers for perception versus cognition subtasks would be clearer if an explicit grouping line or background shading were added.
- [§5] §5 (Discussion): A few citations to contemporaneous MLLM evaluation efforts (e.g., recent works on LLaVA or InstructBLIP) appear to be missing from the related-work section.
- [Figure 2] Figure 2: Axis labels on the radar charts are occasionally truncated; ensure all subtask names remain fully legible at print resolution.
Simulated Author's Rebuttal
We thank the referee for the positive summary and for highlighting areas where additional evidence can strengthen the paper. We address each of the major comments in detail below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The claim that manually designed instruction-answer pairs eliminate data leakage is unsupported by any reported overlap audit, n-gram analysis, or membership inference check against the training corpora of the 30 evaluated MLLMs (e.g., LAION-5B, COCO, or VQAv2). Because every quantitative result rests on the assumption that the test pairs are unseen, this omission is load-bearing for the central validity claim.
Authors: We agree that providing evidence for the lack of data leakage is important to validate the benchmark. Our instruction-answer pairs were entirely manually crafted by the authors, deliberately avoiding any direct extraction from public datasets to prevent leakage. To address this concern, we will add an n-gram overlap analysis with widely used datasets such as COCO, VQAv2, and others in the revised manuscript. A full membership inference check against the proprietary training data of all 30 MLLMs is not possible due to lack of public access to those corpora; however, the manual design process ensures that the pairs are original and not copied from known sources. revision: partial
-
Referee: [§4.2] §4.2 (Model Evaluation): No ablation is presented that varies instruction phrasing while holding the underlying image-question pairs fixed. Without such evidence, the assertion that the chosen concise instructions remove prompt-engineering variance cannot be verified, directly affecting the fairness of the cross-model ranking.
Authors: We thank the referee for this suggestion. While our concise instructions were designed to minimize prompt engineering effects and enable consistent comparisons, we recognize the value of empirical validation. In the revised manuscript, we will include an ablation study where we vary the instruction phrasing for a selection of subtasks and models, demonstrating that the performance rankings remain largely consistent. revision: yes
-
Referee: [§3.2] §3.2 (Annotation Process): Inter-annotator agreement statistics (e.g., Cohen’s κ or percentage agreement) are not reported for the manually created answer labels across the 14 subtasks. This is required to establish that the ground-truth answers are reliable rather than idiosyncratic to the annotators.
Authors: We acknowledge the importance of demonstrating label reliability. The annotations were manually designed by the authors with careful consideration to make answers objective and unambiguous. We did not collect formal inter-annotator agreement statistics during the process. In the revision, we will expand the description of the annotation procedure to better convey how subjectivity was minimized. revision: partial
Circularity Check
No circularity: benchmark is manually constructed and externally evaluated
full rationale
The paper introduces the MME benchmark by manually designing instruction-answer pairs for 14 subtasks to measure perception and cognition in MLLMs. It then directly evaluates 30 external models on these fixed pairs and reports aggregate scores. No parameters are fitted to the benchmark data, no predictions are generated from the benchmark outputs that loop back to its construction, and no uniqueness theorems or ansatzes are invoked via self-citation. The central claims rest on the external model evaluations and the manual design process itself, which is presented as an independent methodological choice rather than a derived result. This satisfies the criteria for a self-contained benchmark paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
-
SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from...
-
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
JMed48k is a new large-scale benchmark of Japanese medical licensing exams with images that reveals proprietary VLMs benefit more from visuals than medical-specific models, with large variation across professions.
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
-
Modality-Decoupled Online Recursive Editing
M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.
-
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
-
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
SaaS-Bench provides 106 realistic professional tasks across 23 deployable SaaS platforms to evaluate LLM-based agents, finding that even the strongest models complete fewer than 4% of tasks end-to-end.
-
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
-
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
-
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
-
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
EgoPoint-Bench reveals that MLLMs suffer from referential hallucination on egocentric pointing and shows that fine-tuning on its synthetic data produces measurable gains with sim-to-real transfer.
-
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
-
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...
-
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
-
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
-
CaptionQA: Is Your Caption as Useful as the Image Itself?
CaptionQA is a new benchmark with 33,027 questions across natural, document, e-commerce, and embodied AI domains that measures how much utility model-generated captions retain compared to original images when used by ...
-
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
-
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
-
Transfer between Modalities with MetaQueries
MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
-
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach wi...
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...
-
Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.
-
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
ILVAD is a plug-and-play method that builds a saliency map from inter-layer attention discrepancies on early tokens to enhance visual evidence focus and ground generated text, reducing hallucinations in LVLMs.
-
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
-
Semantic Generative Tuning for Unified Multimodal Models
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
-
FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation
FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cro...
-
A More Word-like Image Tokenization for MLLMs
DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.
-
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.
-
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation t...
-
SEED: Targeted Data Selection by Weighted Independent Set
SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods o...
-
LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs
LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% ac...
-
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
-
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
Reference graph
Works this paper leans on
-
[1]
Infmllm.https://github.com/mightyzau/InfMLLM, 2023
work page 2023
-
[2]
Lion.https://github.com/mynameischaos/Lion, 2023
work page 2023
-
[3]
Octopus.https://github.com/gray311/UnifiedMultimodalInstructionTuning, 2023
work page 2023
-
[4]
Skywork-mm.https://github.com/will-singularity/Skywork-MM, 2023
work page 2023
-
[5]
Visualglm-6b.https://github.com/THUDM/VisualGLM-6B, 2023
work page 2023
-
[6]
Wemm.https://github.com/scenarios/WeMM, 2023
work page 2023
-
[7]
Xcomposer-vl.https://github.com/InternLM/InternLM-XComposer, 2023
work page 2023
-
[8]
Flamingo: a visual language model for few-shot learning.NeurIPS, 2022
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 2022. 9
work page 2022
-
[9]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020
work page 2020
-
[11]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint:1504.00325, 2015
work page internal anchor Pith review arXiv 2015
-
[12]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint:2305.06500, 2023
work page internal anchor Pith review arXiv 2023
-
[13]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint:2303.03378, 2023
work page internal anchor Pith review arXiv 2023
-
[14]
Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024
work page 2024
-
[15]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Con- ghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint:2304.15010, 2023
work page internal anchor Pith review arXiv 2023
-
[16]
Multimodal-gpt: A vision and lan- guage model for dialogue with humans
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint:2305.04790, 2023
-
[17]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017
work page 2017
-
[18]
Imagebind-llm: Multi-modality instruction tuning
Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning.arXiv preprint:2309.03905, 2023
-
[19]
Bliva: A simple multimodal llm for better handling of text-rich visual questions
Wenbo Hu, Yifan Xu, Y Li, W Li, Z Chen, and Z Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions.arXiv preprint:2308.09936, 2023
-
[20]
Movienet: A holistic dataset for movie understanding
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. InECCV, 2020
work page 2020
-
[21]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models.arXiv preprint:2302.14045, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
Mimic-it: Multi-modal in-context instruction tuning,
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning.arXiv preprint:2306.05425, 2023
-
[23]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint:2305.03726, 2023
work page internal anchor Pith review arXiv 2023
-
[24]
Empowering vision- language models to follow interleaved vision-language in- structions
Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions.arXiv preprint:2308.04152, 2023
-
[25]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint:2301.12597, 2023
work page internal anchor Pith review arXiv 2023
-
[26]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint:2305.10355, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014
work page 2014
-
[28]
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.arXiv preprint:2311.07575, 2023. 10
work page internal anchor Pith review arXiv 2023
-
[29]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning.arXiv preprint:2306.14565, 2023
work page internal anchor Pith review arXiv 2023
-
[30]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint:2304.08485, 2023
work page internal anchor Pith review arXiv 2023
-
[31]
Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021
Jingying Liu, Binyuan Hui, Kun Li, Yunke Liu, Yu-Kun Lai, Yuxiang Zhang, Yebin Liu, and Jingyu Yang. Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021
work page 2021
-
[32]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint:2307.06281, 2023
work page internal anchor Pith review arXiv 2023
-
[33]
Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019
Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019
work page 2019
-
[34]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022
work page 2022
-
[35]
Cheap and quick: Efficient vision-language instruction tuning for large language models,
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language models.arXiv preprint:2305.15023, 2023
-
[36]
Deepart: Learning joint representations of visual arts
Hui Mao, Ming Cheung, and James She. Deepart: Learning joint representations of visual arts. InICM, 2017
work page 2017
-
[37]
Visual arts search on mobile devices.TOMM, 2019
Hui Mao, James She, and Ming Cheung. Visual arts search on mobile devices.TOMM, 2019
work page 2019
-
[38]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InCVPR, 2019
work page 2019
-
[39]
OpenAI. Gpt-4 technical report.arXiv preprint:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.arXiv preprint:2303.17580, 2023
work page internal anchor Pith review arXiv 2023
-
[41]
PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all.arXiv preprint:2305.16355, 2023
work page internal anchor Pith review arXiv 2023
-
[42]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
An overview of large ai models and their applications.Visual Intelligence, 2024
Xiaoguang Tu, Zhi He, Yi Huang, Zhi-Hao Zhang, Ming Yang, and Jian Zhao. An overview of large ai models and their applications.Visual Intelligence, 2024
work page 2024
-
[45]
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint:2205.14100, 2022
work page internal anchor Pith review arXiv 2022
-
[46]
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.arXiv preprint:2305.11175, 2023
-
[47]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025
Hao Wen, Hongbo Kang, Jian Ma, Jing Huang, Yuanwang Yang, Haozhe Lin, Yu-Kun Lai, and Kun Li. Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025
work page 2025
-
[49]
Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval
Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020
work page 2020
-
[50]
arXiv preprint arXiv:2310.16534
Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, and Bing Qin. An early evaluation of gpt-4v (ision).arXiv preprint:2310.16534, 2023. 11
-
[51]
Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren
Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning.arXiv preprint:2212.10773, 2022
-
[52]
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.arXiv preprint:2311.04257, 2023
work page internal anchor Pith review arXiv 2023
-
[53]
A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint:2306.13549, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
Woodpecker: Hallucination correction for multimodal large language models,
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.arXiv preprint:2310.16045, 2023
-
[55]
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024
-
[56]
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang, Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, et al. Reformulating vision-language foundation models and datasets towards universal multimodal assistants.arXiv preprint:2310.00653, 2023
-
[57]
Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs?arXiv preprint:2307.02469, 2023
-
[58]
Transfer visual prompt generator across llms
Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms.arXiv preprint:2305.01278, 2023
-
[59]
Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025
Jinsong Zhang, Xiongzheng Li, Hailong Jia, Jin Li, Zhuo Su, Guidong Wang, and Kun Li. Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025
work page 2025
-
[60]
Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025
Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Zerong Zheng, Yebin Liu, and Kun Li. Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025
work page 2025
-
[61]
Mmicl: Empowering vision-language model with multi-modal in-context learning
Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in- context learning.arXiv preprint:2309.07915, 2023
-
[62]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint:2303.18223, 2023
work page Pith review arXiv 2023
-
[63]
On evaluating ad- versarial robustness of large vision-language models
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.arXiv preprint:2305.16934, 2023
-
[64]
Chatbridge: Bridging modalities with large language model as a language catalyst
Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, and Jing Liu. Chatbridge: Bridging modalities with large language model as a language catalyst.arXiv preprint:2305.16103, 2023
-
[65]
Learning deep features for scene recognition using places database.NeurIPS, 2014
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database.NeurIPS, 2014
work page 2014
-
[66]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint:2304.10592, 2023. 12
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.