arxiv: 2408.13257 · v3 · submitted 2024-08-23 · 💻 cs.CV

Recognition: 1 theorem link

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang , Huanyu Zhang , Haochen Tian , Chaoyou Fu , Shuangqing Zhang , Junfei Wu , Feng Li , Kun Wang

show 5 more authors

Qingsong Wen Zhang Zhang Liang Wang Rong Jin Tieniu Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsbenchmark evaluationhigh-resolution imagesreal-world scenariosvisual question answeringmanual annotationmodel performanceperception challenges

0 comments

The pith

Even the strongest multimodal LLMs fail to reach 60 percent accuracy on high-resolution real-world tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MME-RealWorld to fix three problems in prior benchmarks: small data size that inflates variance, reliance on model-generated labels of uneven quality, and low image resolution that hides real difficulty. Researchers gathered more than 300,000 candidate images, filtered them to 13,366 high-resolution examples, and produced 29,429 manually written question-answer pairs across 43 subtasks in five everyday scenarios. Twenty-five annotators and seven MLLM experts performed the labeling, creating what the authors call the largest manually annotated benchmark of its kind. When 28 leading models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet were tested, none reached 60 percent accuracy, pointing to persistent shortfalls in fine-grained perception and complex scene comprehension.

Core claim

MME-RealWorld is the largest manually annotated multimodal benchmark to date, built from 13,366 high-resolution images and 29,429 question-answer pairs spanning 43 subtasks in five real-world scenarios; evaluation of 28 prominent MLLMs shows that none exceed 60 percent accuracy, leaving high-resolution perception and complex real-world understanding as open challenges.

What carries the argument

The MME-RealWorld dataset, produced by filtering over 300,000 images to 13,366 high-resolution examples and creating 29,429 question-answer pairs through direct annotation by 25 annotators and 7 experts.

If this is right

High-resolution image perception must become a primary target for architectural and training improvements.
Complex real-world scenario understanding will require dedicated advances beyond current scaling approaches.
Future benchmarks should adopt larger manual annotation pipelines and higher native resolutions to reduce performance variance.
The gap between current model scores and practical deployment thresholds indicates that many real-world applications remain out of reach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The open release of the full dataset and evaluation code allows direct comparison of new models against a fixed, high-difficulty standard.
Subtask-level analysis could isolate whether failures stem more from resolution limits or from semantic complexity.
Similar manual filtering and annotation methods could be applied to video or multi-image sequences to extend the evaluation protocol.

Load-bearing premise

The 13,366 selected images and 29,429 QA pairs truly represent high-resolution real-world scenarios that remain extremely difficult even for humans.

What would settle it

Human expert performance scores on the identical 29,429 questions; if experts average below roughly 70 percent, the claim that the benchmark isolates model-specific difficulty would require revision.

read the original abstract

Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than $300$K images from public datasets and the Internet, filtering $13,366$ high-quality images for annotation. This involves the efforts of professional $25$ annotators and $7$ experts in MLLMs, contributing to $29,429$ question-answer pairs that cover $43$ subtasks across $5$ real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving $28$ prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach $60\%$ accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released at https://mme-realworld.github.io/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MME-RealWorld delivers a large manually annotated high-res benchmark with 28 models all below 60%, but the lack of human baselines leaves the 'extremely challenging for humans' claim unverified.

read the letter

The main contribution is a new benchmark built from over 300k images down to 13,366 high-resolution ones, with 29,429 QA pairs manually created across 43 subtasks in five real-world scenarios. They used 25 annotators and 7 experts instead of model-generated labels, which directly addresses the scale, quality, and difficulty gaps they flag in prior work. The evaluation on 28 models, including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, shows none reaching 60% accuracy, and they release the data and code at the project site.

Referee Report

2 major / 2 minor

Summary. The paper introduces MME-RealWorld, a large-scale manually annotated benchmark for MLLMs consisting of 13,366 high-resolution images and 29,429 QA pairs spanning 43 subtasks across 5 real-world scenarios. Images were collected from public datasets and the Internet, filtered for quality, and annotated by 25 annotators plus 7 experts. The authors evaluate 28 prominent MLLMs (including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet) and report that none exceed 60% accuracy, concluding that high-resolution perception and complex real-world reasoning remain open challenges.

Significance. If the annotations prove reliable and the tasks are verifiably difficult for humans, the benchmark would provide a valuable, large-scale resource for measuring progress on high-resolution multimodal understanding. Its manual curation and focus on real-world scenarios could help identify specific failure modes not captured by smaller or synthetic benchmarks, supporting targeted improvements in MLLM architectures for perception and reasoning.

major comments (2)

[Abstract and §3] Abstract and §3 (Benchmark Construction): The repeated claim that the scenarios are 'extremely challenging even for humans' lacks any supporting human performance baseline, inter-annotator agreement statistics, or expert verification pass rates on the final 29,429 QA pairs. Without these metrics it is impossible to distinguish whether model accuracies below 60% reflect genuine perceptual/reasoning difficulty or potential label noise and ambiguity.
[§4] §4 (Evaluation): The headline result that 'none of them reach 60% accuracy' is presented as evidence that high-resolution real-world scenarios remain unsolved, yet the paper provides no separate human evaluation on the released test set to ground this interpretation.

minor comments (2)

[Table 1] Table 1 or equivalent: Add a breakdown of the 43 subtasks with example questions and image resolutions to improve clarity on task coverage.
[§2] §2 (Related Work): Ensure all recent high-resolution MLLM benchmarks are cited for proper context on novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger evidence on task difficulty. We address both major comments below and will revise the manuscript to incorporate human baselines and related metrics.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The repeated claim that the scenarios are 'extremely challenging even for humans' lacks any supporting human performance baseline, inter-annotator agreement statistics, or expert verification pass rates on the final 29,429 QA pairs. Without these metrics it is impossible to distinguish whether model accuracies below 60% reflect genuine perceptual/reasoning difficulty or potential label noise and ambiguity.

Authors: We agree that human performance baselines, inter-annotator agreement (IAA), and explicit verification rates would strengthen the claim of human-level difficulty. The annotation involved 25 professional annotators and 7 MLLM experts with multi-round verification, but the current manuscript does not report formal IAA or human accuracy on the final set. In the revision we will add a dedicated subsection to §3 describing the quality control pipeline, including IAA statistics computed on overlapping annotations and expert verification pass rates. We will also report human performance on a sampled subset of the released test set (approximately 1,000 QA pairs) evaluated by independent experts under identical conditions to the models. revision: yes
Referee: [§4] §4 (Evaluation): The headline result that 'none of them reach 60% accuracy' is presented as evidence that high-resolution real-world scenarios remain unsolved, yet the paper provides no separate human evaluation on the released test set to ground this interpretation.

Authors: We acknowledge that a direct human evaluation on the test set is necessary to ground the interpretation of model performance below 60%. As outlined in the response to the first comment, the revised manuscript will include human accuracy results on a representative sample of the test set. This will allow readers to compare model and human performance directly and confirm that the reported accuracies reflect genuine perceptual and reasoning challenges rather than annotation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark creation with no derivation chain

full rationale

The paper is an empirical effort to collect images, manually annotate 29,429 QA pairs via 25 annotators and 7 experts, and evaluate 28 existing MLLMs on the resulting test set. No equations, fitted parameters, predictions, or mathematical derivations appear anywhere in the manuscript. All claims reduce to direct measurement on the curated data rather than any self-referential construction or self-citation load-bearing step. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert manual annotation produces reliable labels and that the selected images and tasks accurately capture real-world difficulty; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Manual annotation by 25 annotators and 7 MLLM experts produces high-quality QA pairs that reflect genuine task difficulty for humans
Stated in the description of the annotation process and the claim that scenarios are extremely challenging even for humans

pith-pipeline@v0.9.0 · 5645 in / 1257 out tokens · 36916 ms · 2026-05-16T07:55:05.575672+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
cs.CV 2026-04 conditional novelty 6.0

SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietar...
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
cs.CV 2026-04 unverdicted novelty 6.0

SIEVES improves selective prediction coverage up to 3x on OOD VQA benchmarks by training a selector on visual localization quality, generalizing across datasets and proprietary reasoners without specific adaptation.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 conditional novelty 6.0

SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
cs.CV 2026-04 unverdicted novelty 5.0

TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Qwen2.5-Omni Technical Report
cs.CL 2025-03 conditional novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 18 Pith papers · 27 internal anchors

[1]

Ntire 2017 challenge on single image super-resolution: Dataset and study

Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPR, 2017

work page 2017
[2]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Pas- sos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Touchstone: Evaluating vision-language models by language models

Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890, 2023

work page arXiv 2023
[6]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023

work page arXiv 2023
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020

work page 2020
[8]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020. 12

work page 2020
[9]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models? arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qin- glong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023

work page 2023
[14]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. JMLR, 2023

work page 2023
[15]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. NeurIPS, 2024

work page 2024
[16]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Vita: Towards open-source interactive omni multimodal llm

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, et al. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211, 2024

work page arXiv 2024
[19]

A challenger to gpt-4v? early explorations of gemini in visual expertise

Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, et al. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023

work page arXiv 2023
[20]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

A survey for founda- tion models in autonomous driving

Haoxiang Gao, Yaqian Li, Kaiwen Long, Ming Yang, and Yiqing Shen. A survey for founda- tion models in autonomous driving. arXiv preprint arXiv:2402.01105, 2024

work page arXiv 2024
[22]

Convllava: Hierarchical backbones as visual encoder for large multi- modal models

Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, and Bo Zheng. Convllava: Hierarchical backbones as visual encoder for large multi- modal models. arXiv preprint arXiv:2405.15738, 2024

work page arXiv 2024
[23]

Infimm- eval: Complex open-ended reasoning evaluation for multi-modal large language models, 2023

Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, and Hongxia Yang. Infimm- eval: Complex open-ended reasoning evaluation for multi-modal large language models, 2023

work page 2023
[24]

V-rsir: An open access web-based image annotation tool for remote sensing image retrieval

Dongyang Hou, Zelang Miao, Huaqiao Xing, and Hao Wu. V-rsir: An open access web-based image annotation tool for remote sensing image retrieval. IEEE Access, 2019. 13

work page 2019
[25]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024

work page arXiv 2024
[26]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Llvip: A visible-infrared paired dataset for low-light vision

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. Llvip: A visible-infrared paired dataset for low-light vision. In ICCV, 2021

work page 2021
[28]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Obelics: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. NeurIPS, 2024

work page 2024
[30]

Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024

work page 2024
[31]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

SEED-Bench-2-Plus: Benchmarking multimodal large language models with text-rich visual comprehension.arXiv preprint arXiv:2404.16790, 2024

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension. arXiv preprint arXiv:2404.16790, 2024

work page arXiv 2024
[33]

Seed-bench-2: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023

work page arXiv 2023
[34]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Open-sourced data ecosystem in autonomous driving: the present and future

Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Pinlong Cai, Huilin Xu, Dahua Lin, Junchi Yan, Feng Xu, Lu Xiong, et al. Open-sourced data ecosystem in autonomous driving: the present and future. arXiv preprint arXiv:2312.03408, 2023

work page arXiv 2023
[36]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Coda: A real-world road corner case dataset for object detection in autonomous driving

Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei Zhang, Chunjing Xu, Dit-Yan Yeung, et al. Coda: A real-world road corner case dataset for object detection in autonomous driving. In ECCV, 2022

work page 2022
[38]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models, 2024

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models, 2024

work page 2024
[39]

Mini-gemini: Mining the potential of multi-modality vision language models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024

work page arXiv 2024
[40]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 14

work page arXiv 2023
[41]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024

work page 2024
[43]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

J. Liu, D. Liu, W. Yang, S. Xia, X. Zhang, and Y . Dai. A comprehensive benchmark for single image compression artifacts reduction. In arXiv, 2019

work page 2019
[45]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473, 2024

work page arXiv 2024
[47]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language under- standing. arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Wildvision: Evaluating vision-language models in the wild with human preferences

Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences. arXiv preprint arXiv:2406.11069, 2024

work page arXiv 2024
[49]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Docvqa: A dataset for vqa on docu- ment images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on docu- ment images. In WACV, 2021

work page 2021
[51]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, et al. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

arXiv preprint arXiv:2211.01786 , year=

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786 , 2022

work page arXiv 2022
[53]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. 2023

work page 2023
[54]

Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning

Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Mykel Kochen- derfer, Chiho Choi, and Behzad Dariush. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning. In WACV, 2024

work page 2024
[55]

Drivelm: Driving with graph visual question answer- ing

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answer- ing. arXiv preprint arXiv:2312.14150, 2023

work page arXiv 2023
[56]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019

work page 2019
[57]

Xu, Ruiping Wang, W

Xian Sun, Peijin Wang, Zhiyuan Yan, F. Xu, Ruiping Wang, W. Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao Xu, M. Weinmann, S. Hinz, Cheng Wang, and K. Fu. Fair1m: A bench- mark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS, 2021

work page 2021
[58]

Mtvqa: Benchmarking multilingual text-centric visual question answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, et al. Mtvqa: Benchmarking multilingual text-centric visual question answering. arXiv preprint arXiv:2405.11985, 2024. 15

work page arXiv 2024
[59]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

work page 2023
[60]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Cambrian- 1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860, 2024

work page arXiv 2024
[62]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

LLaV A-UHD: an lmm perceiving any aspect ratio and high- resolution images

Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. LLaV A-UHD: an lmm perceiving any aspect ratio and high- resolution images. arXiv preprint arXiv:2403.11703, 2024

work page arXiv 2024
[65]

Hq-50k: A large-scale, high-quality dataset for image restoration

Qinhong Yang, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Lu Yuan, Gang Hua, and Nenghai Yu. Hq-50k: A large-scale, high-quality dataset for image restoration. arXiv preprint arXiv:2306.05390, 2023

work page arXiv 2023
[66]

Llm4drive: A survey of large language models for autonomous driving

Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. Llm4drive: A survey of large language models for autonomous driving. arXiv e-prints, pages arXiv–2311, 2023

work page 2023
[67]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models toward...

work page 2024
[70]

Mm-vet: Evaluating large multimodal models for integrated capabil- ities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabil- ities. In ICML, 2024

work page 2024
[71]

Benchmarking ultra-high-definition image super-resolution

Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren, Björn Stenger, Wei Liu, Hongdong Li, and Ming-Hsuan Yang. Benchmarking ultra-high-definition image super-resolution. In ICCV, 2021

work page 2021
[72]

Internlm- xcomposer: A vision-language large model for ad- vanced text-image comprehension and composition

Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023

work page arXiv 2023
[73]

Beyond llava-hd: Diving into high-resolution large multimodal models

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, Liang Wang, and Rong Jin. Beyond llava-hd: Diving into high-resolution large multimodal models. arXiv preprint arXiv:2406.08487, 2024

work page arXiv 2024
[74]

Embodied understanding of driving scenarios

Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, and Hongyang Li. Embodied understanding of driving scenarios. arXiv preprint arXiv:2403.04593, 2024. 16

work page arXiv 2024
[75]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Detection and tracking meet drones challenge

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge. T-PAMI, 2021. 17 MME-RealWorld ————Appendix———— Contents 1 Introduction 1 2 MME-RealWorld 4 2.1 Instruction and Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Data Collection and Annotation . . ...

work page 2021
[77]

Contact information and addresses (Fig. 5(a)). Recognizing telephone numbers, names of countries/cities/streets, and buildings (469 images and 577 QA pairs)

work page
[78]

Product and Advertisement Perception (Fig. 5(b)). Identifying product names/prices or adver- tisements of shops or brands (803 images and 1, 588 QA pairs)

work page
[79]

Identity Information Perception (Fig. 5(c)). Recognizing license numbers or ID cards of cars/humans (852 QA pairs)

work page
[80]

Other kind of Small Text on Signals or Indicators Perception (Fig. 5(d)). Recognizing small text on indicators, signals, and similar objects (626 images and 1, 198 QA pairs)

work page

Showing first 80 references.