arxiv: 2307.06281 · v5 · submitted 2023-07-12 · 💻 cs.CV · cs.CL

Recognition: 3 theorem links

· Lean Theorem

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu , Haodong Duan , Yuanhan Zhang , Bo Li , Songyang Zhang , Wangbo Zhao , Yike Yuan , Jiaqi Wang

show 4 more authors

Conghui He Ziwei Liu Kai Chen Dahua Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 17:15 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords MMBenchvision-language modelsmultimodal benchmarkCircularEvalbilingual evaluationLLM-assisted scoringobjective evaluation

0 comments

The pith

MMBench provides a bilingual benchmark with CircularEval and LLM-assisted scoring for robust holistic evaluation of vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MMBench to address shortcomings in existing VLM evaluations. Traditional benchmarks deliver quantitative scores but offer little insight into specific abilities, while human-based subjective tests are hard to scale and introduce bias. MMBench assembles a large, carefully controlled set of questions covering many perceptual and reasoning skills, presented in both English and Chinese. It applies a CircularEval procedure and routes free-form model outputs through a large language model to map them onto fixed choices. This produces objective, comparable scores that better reveal whether models qualify as all-around multimodal performers.

Core claim

MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models, built around meticulously curated questions, a CircularEval strategy, LLM conversion of free-form predictions into pre-defined choices, and parallel English-Chinese multiple-choice versions that together surpass prior benchmarks in scale, variety, and reliability.

What carries the argument

The CircularEval strategy paired with LLM-based conversion of free-form predictions into choices, which standardizes outputs from instruction-weak models while preserving evaluation accuracy across bilingual contexts.

Load-bearing premise

The large language model used to map a vision-language model's free-form answers onto the correct multiple-choice options does so without introducing its own systematic errors or biases.

What would settle it

A controlled human review of several hundred LLM-converted answers that shows frequent mismatches with the original model intent would falsify the accuracy of the automated scoring pipeline.

read the original abstract

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area. The evalutation code of MMBench has been integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes MMBench, a bilingual (English/Chinese) benchmark for holistic evaluation of vision-language models. It features a large, curated set of multiple-choice questions with quality controls, a CircularEval strategy that employs an LLM (e.g., GPT-4) to map free-form VLM outputs to predefined choices, and claims this yields more accurate, objective, and scalable results than prior subjective or traditional benchmarks like VQAv2 or OwlEval. The evaluation code is released via VLMEvalKit.

Significance. If the pipeline is shown to be reliable, MMBench would advance VLM evaluation by providing breadth across abilities, bilingual comparability, and an automated conversion step that reduces reliance on human labor. The public integration of the evaluation code into VLMEvalKit is a clear strength for reproducibility and community use.

major comments (1)

[Abstract / Evaluation Pipeline] Abstract, key feature 2: The claim that the LLM-based conversion under CircularEval 'helps to yield accurate evaluation results for models with limited instruction-following capabilities' is load-bearing for the 'objective' and 'robust' benchmark assertion. No human validation, inter-annotator agreement rates, error analysis on mapping accuracy, or bias audit of this step is described. If the LLM systematically misparses ambiguous or low-following outputs, the resulting accuracy metrics could deviate from true model capability.

minor comments (2)

[Abstract] Abstract: 'evalutation' is a typo and should be 'evaluation'.
[Method] The manuscript would benefit from an explicit statement of the total number of questions, the distribution across ability categories, and the exact prompt template used for the LLM conversion step to allow direct replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comment below and commit to strengthening the manuscript with additional validation details.

read point-by-point responses

Referee: [Abstract / Evaluation Pipeline] Abstract, key feature 2: The claim that the LLM-based conversion under CircularEval 'helps to yield accurate evaluation results for models with limited instruction-following capabilities' is load-bearing for the 'objective' and 'robust' benchmark assertion. No human validation, inter-annotator agreement rates, error analysis on mapping accuracy, or bias audit of this step is described. If the LLM systematically misparses ambiguous or low-following outputs, the resulting accuracy metrics could deviate from true model capability.

Authors: We appreciate the referee highlighting this important point. The CircularEval strategy, which uses an LLM to map free-form VLM outputs to predefined choices, is intended to mitigate format mismatches that commonly affect models with weaker instruction-following abilities, thereby supporting more objective scoring than direct string matching. We acknowledge that the submitted manuscript does not include a dedicated human validation study, inter-annotator agreement metrics, or systematic bias audit for the conversion step. In the revised version we will add a new subsection (and corresponding appendix) reporting results from human verification on a sampled subset of predictions. This will quantify mapping accuracy, report agreement rates, and analyze failure cases where the LLM might misparse ambiguous outputs. We expect the added analysis to confirm that CircularEval improves reliability over naive approaches, particularly for lower-instruction-following models, while also discussing any residual limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction relies on external curation and LLM without self-referential reduction

full rationale

The paper describes MMBench as a curated benchmark with quality control, bilingual questions, and a CircularEval pipeline that employs an external LLM (e.g., GPT-4) to map free-form VLM outputs to choices. No mathematical derivation, parameter fitting, prediction step, or uniqueness theorem is claimed or present. The design draws from external data sources and an off-the-shelf LLM without any step that reduces by construction to the paper's own inputs or prior self-citations. The central claim of being an 'objective benchmark' is supported by the described pipeline rather than any self-definitional loop, yielding a self-contained contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that curated questions plus LLM conversion produce unbiased scores; no free parameters, new physical entities, or ad-hoc axioms are introduced beyond standard benchmark-construction practices.

pith-pipeline@v0.9.0 · 5638 in / 950 out tokens · 32932 ms · 2026-05-12T17:15:01.156376+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MMBench methodically develops a comprehensive evaluation pipeline... CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?
cs.AI 2026-05 unverdicted novelty 6.0

Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
cs.LG 2026-05 unverdicted novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Evaluator VLMs frequently fail to detect quality-degrading perturbations in I2T and T2I outputs, with failure rates exceeding 50% in some cases.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension
cs.AI 2026-04 unverdicted novelty 6.0

Modality-native routing in A2A networks raises task accuracy from 32% to 52% over text-bottleneck baselines on a 50-task benchmark, but only when paired with capable downstream reasoning.
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
cs.CV 2026-04 conditional novelty 6.0

Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Multilingual Training and Evaluation Resources for Vision-Language Models
cs.CL 2026-04 conditional novelty 5.0

Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Semantic Reality: Interactive Context-Aware Visualization of Inter-Object Relationships in Augmented Reality
cs.HC 2026-04 unverdicted novelty 4.0

Semantic Reality maintains a persistent connectivity graph of objects in AR via multimodal reasoning and action recognition, then visualizes relationships to aid understanding and task guidance.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 36 Pith papers · 15 internal anchors

[1]

In https://www.w3schools.com/, 2023

W3c school. In https://www.w3schools.com/, 2023. 17

work page 2023
[2]

01-ai. Yi-vl. https://huggingface.co/01-ai/Yi-VL-34B, 2023. 4, 8, 10, 22, 23, 24, 25, 26

work page 2023
[3]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019. 3

work page 2019
[4]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 4, 7, 8, 10, 22, 23, 24, 25, 26

work page 2022
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 4, 8, 10, 20, 22, 23, 24, 25, 26

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 4

work page 1901
[8]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 4, 22, 23, 24, 25, 26

work page internal anchor Pith review arXiv 2023
[9]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 3, 17

work page internal anchor Pith review arXiv 2015
[10]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github. com/InternLM/xtuner, 2023. 4, 10, 22, 23, 24, 25, 26

work page 2023
[11]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 4, 8, 10, 20, 22, 23, 24, 25, 26

work page internal anchor Pith review arXiv 2023
[12]

Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...

work page arXiv 2024
[13]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022. 10, 20, 22, 23, 24, 25, 26

work page 2022
[14]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024. 1, 8 27

work page 2024
[15]

Mitigating representation bias in action recognition: Algorithms and benchmarks, 2022

Haodong Duan, Yue Zhao, Kai Chen, Yuanjun Xiong, and Dahua Lin. Mitigating representation bias in action recognition: Algorithms and benchmarks, 2022. 17

work page 2022
[16]

The modularity of mind

Jerry A Fodor. The modularity of mind. MIT press, 1983. 4

work page 1983
[17]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394, 2023. 3, 18

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Multimodal-gpt: A vision and language model for dialogue with humans, 2023

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023. 2

work page 2023
[19]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017. 2, 3

work page 2017
[20]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3608–3617, 2018. 3

work page 2018
[21]

V . Hosu, H. Lin, T. Sziranyi, and D. Saupe. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing, 29:4041–4056, 2020. 17

work page 2020
[22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 2, 3, 19, 21

work page 2019
[24]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 17

work page 2017
[25]

ShapeWorld - A new test methodology for multimodal language understanding

Alexander Kuhnle and Ann Copestake. Shapeworld-a new test methodology for multimodal language understanding. arXiv preprint arXiv:1704.04517, 2017. 17

work page Pith review arXiv 2017
[26]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023. 8, 10, 20, 22, 23, 24, 25, 26

work page 2023
[27]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 18

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Dual-glance model for deciphering social relationships

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. Dual-glance model for deciphering social relationships. In Proceedings of the IEEE international conference on computer vision, pages 2650–2659, 2017. 17

work page 2017
[30]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 22, 23, 24, 25, 26 28

work page arXiv 2023
[31]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023. 17

work page 2023
[32]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 4, 8, 10, 22, 23, 24, 25, 26

work page internal anchor Pith review arXiv 2023
[33]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. 2, 4, 9, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022. 3, 17

work page 2022
[35]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 2, 3, 6, 19, 21

work page 2019
[36]

Bayesian rationality: The probabilistic approach to human reasoning

Mike Oaksford and Nick Chater. Bayesian rationality: The probabilistic approach to human reasoning. Oxford University Press, 2007. 4

work page 2007
[37]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. 1, 2, 4, 5, 8, 10, 17, 20, 23, 24, 25, 26

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Omnilmm: Large multi-modal models for strong performance and efficient deploy- ment

OpenBMB. Omnilmm: Large multi-modal models for strong performance and efficient deploy- ment. https://github.com/OpenBMB/OmniLMM, 2023. 22, 23, 24, 25, 26

work page 2023
[39]

Minicpm: Unveiling the potential of end-side large language models, 2024

OpenBMB. Minicpm: Unveiling the potential of end-side large language models, 2024. 8, 10, 20, 22, 23, 24, 25, 26

work page 2024
[40]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. 4

work page 2022
[41]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 4

work page 2019
[42]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 3, 17, 19, 21

work page 2019
[43]

Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023. 22, 23, 24, 25, 26

work page arXiv 2023
[44]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 2, 4, 5, 8, 10, 17, 20, 23, 24, 25, 26

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Internlm: A multilingual language model with progressively enhanced capabilities

InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport, 2023. 8, 9, 18

work page 2023
[46]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Cogvlm: Visual expert for pretrained language 11 models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models. ArXiv, abs/2311.03079, 2023. 8, 10, 22, 23, 24, 25, 26 29

work page arXiv 2023
[48]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Jiao Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. 2023. 2

work page 2023
[49]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 2, 3, 4, 22

work page Pith review arXiv 2023
[50]

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Qinghao Ye, Haiyang Xu, Jiabo Ye, Mingshi Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. ArXiv, abs/2311.04257, 2023. 8, 10, 23, 24, 25, 26

work page arXiv 2023
[51]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 3

work page 2014
[52]

Internlm- xcomposer: A vision-language large model for advanced text-image comprehension and composition

Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023. 22, 23, 24, 25, 26

work page arXiv 2023
[53]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 2, 4, 9

work page 2023
[54]

Places: A 10 million image database for scene recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 17

work page 2017
[55]

Towards automatic learning of procedures from web instructional videos

Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 32, 2018. 3

work page 2018
[56]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 2, 4, 8, 10, 20, 22, 23, 24, 25, 26 30

work page internal anchor Pith review Pith/arXiv arXiv 2023