MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Caifeng Shan; Chaoyou Fu; Jinrui Yang; Ke Li; Mengdan Zhang; Peixian Chen; Ran He; Rongrong Ji; Xiawu Zheng; Xing Sun

arxiv: 2306.13394 · v5 · submitted 2023-06-23 · 💻 cs.CV

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu , Peixian Chen , Yunhang Shen , Yulei Qin , Mengdan Zhang , Xu Lin , Jinrui Yang , Xiawu Zheng

show 6 more authors

Ke Li Xing Sun Yunsheng Wu Rongrong Ji Caifeng Shan Ran He

This is my paper

Pith reviewed 2026-05-10 20:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsevaluation benchmarkperception abilitiescognition abilitiesinstruction-answer pairsmodel comparison

0 comments

The pith

A new benchmark evaluates multimodal large language models on 14 perception and cognition subtasks using hand-designed questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a benchmark to test how well multimodal large language models handle both perception tasks such as recognizing objects in images and cognition tasks such as reasoning from visual input. It creates 14 subtasks with manually written instruction-answer pairs to avoid models simply remembering data from public sources. The short, fixed instructions let different models be compared directly without extra prompt tuning. When 30 current models are run through the benchmark, the results show clear shortfalls in many areas and suggest specific places where future models could be strengthened.

Core claim

The central claim is that a benchmark built from 14 subtasks can measure both perception and cognition abilities in multimodal large language models, that manually designed instruction-answer pairs prevent data leakage while keeping comparisons fair, and that evaluations of 30 existing models demonstrate substantial remaining gaps along with concrete directions for improvement.

What carries the argument

The MME benchmark, which consists of 14 subtasks split between perception and cognition, each using concise manually crafted instruction-answer pairs that support direct scoring without prompt engineering.

If this is right

Models can be ranked on specific perception and cognition skills without the results depending on how prompts are worded.
Weaknesses in particular subtasks become visible so optimization can target those gaps directly.
Quantitative scores across many models become possible, revealing patterns that case studies alone do not show.
Future model releases can be checked against the same fixed set of tasks for consistent progress tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could become a standard test set that new models are required to report on before publication.
Training pipelines might incorporate the 14 subtasks as additional supervision signals to close the observed gaps.
Similar hand-designed evaluation sets could be created for other multimodal domains such as video or audio.

Load-bearing premise

The hand-designed instruction-answer pairs are sufficient to block data leakage from existing public datasets and the short instructions produce fair comparisons across models without any prompt tuning.

What would settle it

A model achieving significantly higher scores on the same subtasks when given different or longer instructions, or evidence that the test pairs appear in the training data of evaluated models.

read the original abstract

Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data are released at the project page https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MME, the first comprehensive benchmark for Multimodal Large Language Models (MLLMs), comprising 14 subtasks that separately assess perception and cognition abilities. To mitigate data leakage, all instruction-answer pairs are manually designed rather than drawn from public datasets; concise, fixed instructions are used to enable direct, prompt-engineering-free comparisons across models. The authors evaluate 30 advanced MLLMs on the benchmark and conclude that substantial headroom remains for improvement in both perception and cognition.

Significance. If the no-leakage and instruction-invariance properties can be demonstrated, MME would supply a much-needed standardized yardstick for MLLM progress, analogous to GLUE or ImageNet in their respective domains. The public release of the data and the separation of perception versus cognition subtasks are concrete strengths that would allow the community to track targeted improvements.

major comments (3)

[§3] §3 (Benchmark Construction): The claim that manually designed instruction-answer pairs eliminate data leakage is unsupported by any reported overlap audit, n-gram analysis, or membership inference check against the training corpora of the 30 evaluated MLLMs (e.g., LAION-5B, COCO, or VQAv2). Because every quantitative result rests on the assumption that the test pairs are unseen, this omission is load-bearing for the central validity claim.
[§4.2] §4.2 (Model Evaluation): No ablation is presented that varies instruction phrasing while holding the underlying image-question pairs fixed. Without such evidence, the assertion that the chosen concise instructions remove prompt-engineering variance cannot be verified, directly affecting the fairness of the cross-model ranking.
[§3.2] §3.2 (Annotation Process): Inter-annotator agreement statistics (e.g., Cohen’s κ or percentage agreement) are not reported for the manually created answer labels across the 14 subtasks. This is required to establish that the ground-truth answers are reliable rather than idiosyncratic to the annotators.

minor comments (3)

[Table 1] Table 1: The column headers for perception versus cognition subtasks would be clearer if an explicit grouping line or background shading were added.
[§5] §5 (Discussion): A few citations to contemporaneous MLLM evaluation efforts (e.g., recent works on LLaVA or InstructBLIP) appear to be missing from the related-work section.
[Figure 2] Figure 2: Axis labels on the radar charts are occasionally truncated; ensure all subtask names remain fully legible at print resolution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive summary and for highlighting areas where additional evidence can strengthen the paper. We address each of the major comments in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The claim that manually designed instruction-answer pairs eliminate data leakage is unsupported by any reported overlap audit, n-gram analysis, or membership inference check against the training corpora of the 30 evaluated MLLMs (e.g., LAION-5B, COCO, or VQAv2). Because every quantitative result rests on the assumption that the test pairs are unseen, this omission is load-bearing for the central validity claim.

Authors: We agree that providing evidence for the lack of data leakage is important to validate the benchmark. Our instruction-answer pairs were entirely manually crafted by the authors, deliberately avoiding any direct extraction from public datasets to prevent leakage. To address this concern, we will add an n-gram overlap analysis with widely used datasets such as COCO, VQAv2, and others in the revised manuscript. A full membership inference check against the proprietary training data of all 30 MLLMs is not possible due to lack of public access to those corpora; however, the manual design process ensures that the pairs are original and not copied from known sources. revision: partial
Referee: [§4.2] §4.2 (Model Evaluation): No ablation is presented that varies instruction phrasing while holding the underlying image-question pairs fixed. Without such evidence, the assertion that the chosen concise instructions remove prompt-engineering variance cannot be verified, directly affecting the fairness of the cross-model ranking.

Authors: We thank the referee for this suggestion. While our concise instructions were designed to minimize prompt engineering effects and enable consistent comparisons, we recognize the value of empirical validation. In the revised manuscript, we will include an ablation study where we vary the instruction phrasing for a selection of subtasks and models, demonstrating that the performance rankings remain largely consistent. revision: yes
Referee: [§3.2] §3.2 (Annotation Process): Inter-annotator agreement statistics (e.g., Cohen’s κ or percentage agreement) are not reported for the manually created answer labels across the 14 subtasks. This is required to establish that the ground-truth answers are reliable rather than idiosyncratic to the annotators.

Authors: We acknowledge the importance of demonstrating label reliability. The annotations were manually designed by the authors with careful consideration to make answers objective and unambiguous. We did not collect formal inter-annotator agreement statistics during the process. In the revision, we will expand the description of the annotation procedure to better convey how subjectivity was minimized. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark is manually constructed and externally evaluated

full rationale

The paper introduces the MME benchmark by manually designing instruction-answer pairs for 14 subtasks to measure perception and cognition in MLLMs. It then directly evaluates 30 external models on these fixed pairs and reports aggregate scores. No parameters are fitted to the benchmark data, no predictions are generated from the benchmark outputs that loop back to its construction, and no uniqueness theorems or ansatzes are invoked via self-citation. The central claims rest on the external model evaluations and the manual design process itself, which is presented as an independent methodological choice rather than a derived result. This satisfies the criteria for a self-contained benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No free parameters, mathematical axioms, or invented entities are introduced; the central claim rests on the manual annotation process and subtask selection.

pith-pipeline@v0.9.0 · 5557 in / 1042 out tokens · 47919 ms · 2026-05-10T20:20:20.783210+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 accept novelty 8.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
cs.NE 2026-04 unverdicted novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
cs.CV 2025-11 unverdicted novelty 8.0

MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from...
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
cs.CV 2026-05 conditional novelty 7.0

JMed48k is a new large-scale benchmark of Japanese medical licensing exams with images that reveals proprietary VLMs benefit more from visuals than medical-specific models, with large variation across professions.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 7.0

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
Modality-Decoupled Online Recursive Editing
cs.LG 2026-05 conditional novelty 7.0

M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 unverdicted novelty 7.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
cs.AI 2026-05 unverdicted novelty 7.0

SaaS-Bench provides 106 realistic professional tasks across 23 deployable SaaS platforms to evaluate LLM-based agents, finding that even the strongest models complete fewer than 4% of tasks end-to-end.
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
cs.DB 2026-05 conditional novelty 7.0

OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
cs.CL 2026-05 unverdicted novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
cs.CV 2026-04 unverdicted novelty 7.0

EgoPoint-Bench reveals that MLLMs suffer from referential hallucination on egocentric pointing and shows that fine-tuning on its synthetic data produces measurable gains with sim-to-real transfer.
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
cs.CV 2026-04 unverdicted novelty 7.0

DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
cs.CV 2026-04 unverdicted novelty 7.0

ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
cs.CV 2026-02 unverdicted novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
cs.AI 2026-01 unverdicted novelty 7.0

Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
cs.CV 2025-11 unverdicted novelty 7.0

AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
CaptionQA: Is Your Caption as Useful as the Image Itself?
cs.CV 2025-11 conditional novelty 7.0

CaptionQA is a new benchmark with 33,027 questions across natural, document, e-commerce, and embodied AI domains that measures how much utility model-generated captions retain compared to original images when used by ...
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
cs.CV 2025-10 conditional novelty 7.0

XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
cs.LG 2025-09 conditional novelty 7.0

Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
MMSearch-R1: Incentivizing LMMs to Search
cs.CV 2025-06 unverdicted novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
cs.CV 2025-04 unverdicted novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
Transfer between Modalities with MetaQueries
cs.CV 2025-04 unverdicted novelty 7.0

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
cs.CL 2025-03 unverdicted novelty 7.0

AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach wi...
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
cs.AI 2025-03 conditional novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
cs.CV 2024-10 unverdicted novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
cs.AI 2024-10 unverdicted novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
cs.AI 2024-07 accept novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
cs.CV 2023-10 unverdicted novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 6.0

Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...
Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
cs.CV 2026-05 unverdicted novelty 6.0

Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
cs.CV 2026-05 unverdicted novelty 6.0

ILVAD is a plug-and-play method that builds a saliency map from inter-layer attention discrepancies on early tokens to enhance visual evidence focus and ground generated text, reducing hallucinations in LVLMs.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
cs.CV 2026-05 conditional novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
Semantic Generative Tuning for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation
cs.CE 2026-05 unverdicted novelty 6.0

FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cro...
A More Word-like Image Tokenization for MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
cs.CV 2026-05 unverdicted novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation t...
SEED: Targeted Data Selection by Weighted Independent Set
cs.LG 2026-05 unverdicted novelty 6.0

SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods o...
LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% ac...
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
cs.CV 2026-05 unverdicted novelty 6.0

A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 167 Pith papers · 23 internal anchors

[1]

Infmllm.https://github.com/mightyzau/InfMLLM, 2023

work page 2023
[2]

Lion.https://github.com/mynameischaos/Lion, 2023

work page 2023
[3]

Octopus.https://github.com/gray311/UnifiedMultimodalInstructionTuning, 2023

work page 2023
[4]

Skywork-mm.https://github.com/will-singularity/Skywork-MM, 2023

work page 2023
[5]

Visualglm-6b.https://github.com/THUDM/VisualGLM-6B, 2023

work page 2023
[6]

Wemm.https://github.com/scenarios/WeMM, 2023

work page 2023
[7]

Xcomposer-vl.https://github.com/InternLM/InternLM-XComposer, 2023

work page 2023
[8]

Flamingo: a visual language model for few-shot learning.NeurIPS, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 2022. 9

work page 2022
[9]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020

work page 2020
[11]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint:1504.00325, 2015

work page internal anchor Pith review arXiv 2015
[12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint:2305.06500, 2023

work page internal anchor Pith review arXiv 2023
[13]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint:2303.03378, 2023

work page internal anchor Pith review arXiv 2023
[14]

Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024

work page 2024
[15]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Con- ghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint:2304.15010, 2023

work page internal anchor Pith review arXiv 2023
[16]

Multimodal-gpt: A vision and lan- guage model for dialogue with humans

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint:2305.04790, 2023

work page arXiv 2023
[17]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017

work page 2017
[18]

Imagebind-llm: Multi-modality instruction tuning

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning.arXiv preprint:2309.03905, 2023

work page arXiv 2023
[19]

Bliva: A simple multimodal llm for better handling of text-rich visual questions

Wenbo Hu, Yifan Xu, Y Li, W Li, Z Chen, and Z Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions.arXiv preprint:2308.09936, 2023

work page arXiv 2023
[20]

Movienet: A holistic dataset for movie understanding

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. InECCV, 2020

work page 2020
[21]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models.arXiv preprint:2302.14045, 2023

work page internal anchor Pith review arXiv 2023
[22]

Mimic-it: Multi-modal in-context instruction tuning,

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning.arXiv preprint:2306.05425, 2023

work page arXiv 2023
[23]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint:2305.03726, 2023

work page internal anchor Pith review arXiv 2023
[24]

Empowering vision- language models to follow interleaved vision-language in- structions

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions.arXiv preprint:2308.04152, 2023

work page arXiv 2023
[25]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint:2301.12597, 2023

work page internal anchor Pith review arXiv 2023
[26]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint:2305.10355, 2023

work page internal anchor Pith review arXiv 2023
[27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

work page 2014
[28]

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.arXiv preprint:2311.07575, 2023. 10

work page internal anchor Pith review arXiv 2023
[29]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning.arXiv preprint:2306.14565, 2023

work page internal anchor Pith review arXiv 2023
[30]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint:2304.08485, 2023

work page internal anchor Pith review arXiv 2023
[31]

Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021

Jingying Liu, Binyuan Hui, Kun Li, Yunke Liu, Yu-Kun Lai, Yuxiang Zhang, Yebin Liu, and Jingyu Yang. Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021

work page 2021
[32]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint:2307.06281, 2023

work page internal anchor Pith review arXiv 2023
[33]

Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019

Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019

work page 2019
[34]

Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

work page 2022
[35]

Cheap and quick: Efficient vision-language instruction tuning for large language models,

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language models.arXiv preprint:2305.15023, 2023

work page arXiv 2023
[36]

Deepart: Learning joint representations of visual arts

Hui Mao, Ming Cheung, and James She. Deepart: Learning joint representations of visual arts. InICM, 2017

work page 2017
[37]

Visual arts search on mobile devices.TOMM, 2019

Hui Mao, James She, and Ming Cheung. Visual arts search on mobile devices.TOMM, 2019

work page 2019
[38]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

work page 2019
[39]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.arXiv preprint:2303.17580, 2023

work page internal anchor Pith review arXiv 2023
[41]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all.arXiv preprint:2305.16355, 2023

work page internal anchor Pith review arXiv 2023
[42]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

An overview of large ai models and their applications.Visual Intelligence, 2024

Xiaoguang Tu, Zhi He, Yi Huang, Zhi-Hao Zhang, Ming Yang, and Jian Zhao. An overview of large ai models and their applications.Visual Intelligence, 2024

work page 2024
[45]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint:2205.14100, 2022

work page internal anchor Pith review arXiv 2022
[46]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.arXiv preprint:2305.11175, 2023

work page arXiv 2023
[47]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

Hao Wen, Hongbo Kang, Jian Ma, Jing Huang, Yuanwang Yang, Haozhe Lin, Yu-Kun Lai, and Kun Li. Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

work page 2025
[49]

Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020

work page 2020
[50]

arXiv preprint arXiv:2310.16534

Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, and Bing Qin. An early evaluation of gpt-4v (ision).arXiv preprint:2310.16534, 2023. 11

work page arXiv 2023
[51]

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren

Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning.arXiv preprint:2212.10773, 2022

work page arXiv 2022
[52]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.arXiv preprint:2311.04257, 2023

work page internal anchor Pith review arXiv 2023
[53]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint:2306.13549, 2023

work page internal anchor Pith review arXiv 2023
[54]

Woodpecker: Hallucination correction for multimodal large language models,

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.arXiv preprint:2310.16045, 2023

work page arXiv 2023
[55]

Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024
[56]

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang, Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, et al. Reformulating vision-language foundation models and datasets towards universal multimodal assistants.arXiv preprint:2310.00653, 2023

work page arXiv 2023
[57]

What matters in training a gpt4-style language model with multi- modal inputs? ArXiv, abs/2307.02469, 2023

Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs?arXiv preprint:2307.02469, 2023

work page arXiv 2023
[58]

Transfer visual prompt generator across llms

Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms.arXiv preprint:2305.01278, 2023

work page arXiv 2023
[59]

Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

Jinsong Zhang, Xiongzheng Li, Hailong Jia, Jin Li, Zhuo Su, Guidong Wang, and Kun Li. Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

work page 2025
[60]

Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Zerong Zheng, Yebin Liu, and Kun Li. Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

work page 2025
[61]

Mmicl: Empowering vision-language model with multi-modal in-context learning

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in- context learning.arXiv preprint:2309.07915, 2023

work page arXiv 2023
[62]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint:2303.18223, 2023

work page Pith review arXiv 2023
[63]

On evaluating ad- versarial robustness of large vision-language models

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.arXiv preprint:2305.16934, 2023

work page arXiv 2023
[64]

Chatbridge: Bridging modalities with large language model as a language catalyst

Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, and Jing Liu. Chatbridge: Bridging modalities with large language model as a language catalyst.arXiv preprint:2305.16103, 2023

work page arXiv 2023
[65]

Learning deep features for scene recognition using places database.NeurIPS, 2014

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database.NeurIPS, 2014

work page 2014
[66]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint:2304.10592, 2023. 12

work page internal anchor Pith review arXiv 2023

[1] [1]

Infmllm.https://github.com/mightyzau/InfMLLM, 2023

work page 2023

[2] [2]

Lion.https://github.com/mynameischaos/Lion, 2023

work page 2023

[3] [3]

Octopus.https://github.com/gray311/UnifiedMultimodalInstructionTuning, 2023

work page 2023

[4] [4]

Skywork-mm.https://github.com/will-singularity/Skywork-MM, 2023

work page 2023

[5] [5]

Visualglm-6b.https://github.com/THUDM/VisualGLM-6B, 2023

work page 2023

[6] [6]

Wemm.https://github.com/scenarios/WeMM, 2023

work page 2023

[7] [7]

Xcomposer-vl.https://github.com/InternLM/InternLM-XComposer, 2023

work page 2023

[8] [8]

Flamingo: a visual language model for few-shot learning.NeurIPS, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 2022. 9

work page 2022

[9] [9]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020

work page 2020

[11] [11]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint:1504.00325, 2015

work page internal anchor Pith review arXiv 2015

[12] [12]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint:2305.06500, 2023

work page internal anchor Pith review arXiv 2023

[13] [13]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint:2303.03378, 2023

work page internal anchor Pith review arXiv 2023

[14] [14]

Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench- video: A long-form multi-shot benchmark for holistic video understanding.NeurIPS, 2024

work page 2024

[15] [15]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Con- ghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model.arXiv preprint:2304.15010, 2023

work page internal anchor Pith review arXiv 2023

[16] [16]

Multimodal-gpt: A vision and lan- guage model for dialogue with humans

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint:2305.04790, 2023

work page arXiv 2023

[17] [17]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017

work page 2017

[18] [18]

Imagebind-llm: Multi-modality instruction tuning

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning.arXiv preprint:2309.03905, 2023

work page arXiv 2023

[19] [19]

Bliva: A simple multimodal llm for better handling of text-rich visual questions

Wenbo Hu, Yifan Xu, Y Li, W Li, Z Chen, and Z Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions.arXiv preprint:2308.09936, 2023

work page arXiv 2023

[20] [20]

Movienet: A holistic dataset for movie understanding

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. InECCV, 2020

work page 2020

[21] [21]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models.arXiv preprint:2302.14045, 2023

work page internal anchor Pith review arXiv 2023

[22] [22]

Mimic-it: Multi-modal in-context instruction tuning,

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning.arXiv preprint:2306.05425, 2023

work page arXiv 2023

[23] [23]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint:2305.03726, 2023

work page internal anchor Pith review arXiv 2023

[24] [24]

Empowering vision- language models to follow interleaved vision-language in- structions

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Hanwang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions.arXiv preprint:2308.04152, 2023

work page arXiv 2023

[25] [25]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint:2301.12597, 2023

work page internal anchor Pith review arXiv 2023

[26] [26]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint:2305.10355, 2023

work page internal anchor Pith review arXiv 2023

[27] [27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014

work page 2014

[28] [28]

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.arXiv preprint:2311.07575, 2023. 10

work page internal anchor Pith review arXiv 2023

[29] [29]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning.arXiv preprint:2306.14565, 2023

work page internal anchor Pith review arXiv 2023

[30] [30]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint:2304.08485, 2023

work page internal anchor Pith review arXiv 2023

[31] [31]

Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021

Jingying Liu, Binyuan Hui, Kun Li, Yunke Liu, Yu-Kun Lai, Yuxiang Zhang, Yebin Liu, and Jingyu Yang. Geometry-guided dense perspective network for speech-driven facial animation.IEEE TVCG, 2021

work page 2021

[32] [32]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player?arXiv preprint:2307.06281, 2023

work page internal anchor Pith review arXiv 2023

[33] [33]

Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019

Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. Curved scene text detection via transverse and longitudinal sequence connection.PR, 2019

work page 2019

[34] [34]

Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.NeurIPS, 2022

work page 2022

[35] [35]

Cheap and quick: Efficient vision-language instruction tuning for large language models,

Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language models.arXiv preprint:2305.15023, 2023

work page arXiv 2023

[36] [36]

Deepart: Learning joint representations of visual arts

Hui Mao, Ming Cheung, and James She. Deepart: Learning joint representations of visual arts. InICM, 2017

work page 2017

[37] [37]

Visual arts search on mobile devices.TOMM, 2019

Hui Mao, James She, and Ming Cheung. Visual arts search on mobile devices.TOMM, 2019

work page 2019

[38] [38]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InCVPR, 2019

work page 2019

[39] [39]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.arXiv preprint:2303.17580, 2023

work page internal anchor Pith review arXiv 2023

[41] [41]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all.arXiv preprint:2305.16355, 2023

work page internal anchor Pith review arXiv 2023

[42] [42]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025

[43] [43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

An overview of large ai models and their applications.Visual Intelligence, 2024

Xiaoguang Tu, Zhi He, Yi Huang, Zhi-Hao Zhang, Ming Yang, and Jian Zhao. An overview of large ai models and their applications.Visual Intelligence, 2024

work page 2024

[45] [45]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint:2205.14100, 2022

work page internal anchor Pith review arXiv 2022

[46] [46]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.arXiv preprint:2305.11175, 2023

work page arXiv 2023

[47] [47]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

Hao Wen, Hongbo Kang, Jian Ma, Jing Huang, Yuanwang Yang, Haozhe Lin, Yu-Kun Lai, and Kun Li. Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE TPAMI, 2025

work page 2025

[49] [49]

Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval

Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020

work page 2020

[50] [50]

arXiv preprint arXiv:2310.16534

Yang Wu, Shilong Wang, Hao Yang, Tian Zheng, Hongbo Zhang, Yanyan Zhao, and Bing Qin. An early evaluation of gpt-4v (ision).arXiv preprint:2310.16534, 2023. 11

work page arXiv 2023

[51] [51]

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren

Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning.arXiv preprint:2212.10773, 2022

work page arXiv 2022

[52] [52]

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.arXiv preprint:2311.04257, 2023

work page internal anchor Pith review arXiv 2023

[53] [53]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.arXiv preprint:2306.13549, 2023

work page internal anchor Pith review arXiv 2023

[54] [54]

Woodpecker: Hallucination correction for multimodal large language models,

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. Woodpecker: Hallucination correction for multimodal large language models.arXiv preprint:2310.16045, 2023

work page arXiv 2023

[55] [55]

Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024

[56] [56]

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang, Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, et al. Reformulating vision-language foundation models and datasets towards universal multimodal assistants.arXiv preprint:2310.00653, 2023

work page arXiv 2023

[57] [57]

What matters in training a gpt4-style language model with multi- modal inputs? ArXiv, abs/2307.02469, 2023

Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, and Tao Kong. What matters in training a gpt4-style language model with multimodal inputs?arXiv preprint:2307.02469, 2023

work page arXiv 2023

[58] [58]

Transfer visual prompt generator across llms

Ao Zhang, Hao Fei, Yuan Yao, Wei Ji, Li Li, Zhiyuan Liu, and Tat-Seng Chua. Transfer visual prompt generator across llms.arXiv preprint:2305.01278, 2023

work page arXiv 2023

[59] [59]

Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

Jinsong Zhang, Xiongzheng Li, Hailong Jia, Jin Li, Zhuo Su, Guidong Wang, and Kun Li. Logavatar: Local gaussian splatting for human avatar modeling from monocular video.CAD, 2025

work page 2025

[60] [60]

Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Zerong Zheng, Yebin Liu, and Kun Li. Speechact: Towards generating whole-body motion from speech.IEEE TVCG, 2025

work page 2025

[61] [61]

Mmicl: Empowering vision-language model with multi-modal in-context learning

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in- context learning.arXiv preprint:2309.07915, 2023

work page arXiv 2023

[62] [62]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint:2303.18223, 2023

work page Pith review arXiv 2023

[63] [63]

On evaluating ad- versarial robustness of large vision-language models

Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.arXiv preprint:2305.16934, 2023

work page arXiv 2023

[64] [64]

Chatbridge: Bridging modalities with large language model as a language catalyst

Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, and Jing Liu. Chatbridge: Bridging modalities with large language model as a language catalyst.arXiv preprint:2305.16103, 2023

work page arXiv 2023

[65] [65]

Learning deep features for scene recognition using places database.NeurIPS, 2014

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database.NeurIPS, 2014

work page 2014

[66] [66]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint:2304.10592, 2023. 12

work page internal anchor Pith review arXiv 2023