WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Aleksandra Korolova; Boya Zeng; Chung Peng Lee; Gabriel Sarch; Harish Krishnakumar; Hu Xu; Shengbang Tong; Wenhao Chai; Wenhu Chen; Xingyu Fu

arxiv: 2606.06538 · v1 · pith:3QP63GBAnew · submitted 2026-06-04 · 💻 cs.CV

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Yida Yin , Harish Krishnakumar , Chung Peng Lee , Boya Zeng , Wenhao Chai , Shengbang Tong , Wenhu Chen , Hu Xu

show 4 more authors

Xingyu Fu Gabriel Sarch Aleksandra Korolova Zhuang Liu

This is my paper

Pith reviewed 2026-06-28 02:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords WorldBenchmultimodal benchmarkvisual diversityMLLM evaluationvisual understandingreasoning benchmarktaxonomy curationimage diversity

0 comments

The pith

WorldBench uses a visual concept taxonomy to curate diverse images and shows even top MLLMs reach only 64 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates WorldBench to fix the gap in existing multimodal benchmarks, which add tasks but rarely capture the range of real-world visual inputs. It builds a taxonomy spanning thousands of concepts, pulls images from search engines and datasets to match that taxonomy, and writes hard questions that current models miss. Quantitative and human checks confirm the new benchmark has more visual diversity than prior ones. When 15 MLLMs are tested, the strongest scores 64 percent while others stay near chance level. The work shows why visual diversity matters for reliable performance outside narrow test sets.

Core claim

WorldBench is built by defining a taxonomy of visual concepts across domains, then manually selecting images and writing questions that frontier models fail. This produces a benchmark with measurably higher visual diversity than existing ones. Evaluation of 15 MLLMs finds the best model at 64.0 percent accuracy, with several others only marginally above random guessing, exposing limits in current visual understanding.

What carries the argument

A taxonomy of thousands of visual concepts that guides curation of images from search engines and existing datasets, combined with trial-and-error manual question design.

If this is right

MLLMs require stronger mechanisms for handling varied visual inputs to reach reliable real-world performance.
Benchmark construction should prioritize explicit coverage of visual concepts over simply adding more task types.
Models that score near chance on diverse images are unlikely to generalize safely to open-ended visual settings.
Evaluation protocols for multimodal systems should include diversity metrics alongside accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data that lacks similar visual breadth may be a root cause of the observed performance gaps.
The benchmark could serve as a filter for selecting models intended for deployment in uncontrolled visual environments.
Extending the taxonomy approach to video or 3D inputs would test whether the same diversity issues appear in other modalities.
Human performance baselines on the same questions would clarify how far current models remain from human-level visual reasoning.

Load-bearing premise

The manually chosen images and questions, shaped by the taxonomy, test general visual understanding rather than just the specific concepts or sources selected.

What would settle it

Quantitative diversity scores showing WorldBench is not higher than prior benchmarks, or a new model reaching above 80 percent accuracy while still performing strongly on older tests.

read the original abstract

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldBench brings a taxonomy-guided image set and claims higher visual diversity, but its trial-and-error question filter against current models makes the 64% ceiling hard to interpret as a general visual gap.

read the letter

The main things to know are that the authors built a taxonomy of visual concepts, pulled images from search and datasets to cover it, then used structured trial-and-error to write questions that top MLLMs miss. They report higher diversity scores than prior benchmarks and show the best model at 64% accuracy.

What stands out as new is the explicit taxonomy spanning domains like living things and the attempt to measure diversity both quantitatively and with humans. That curation step is more deliberate than the usual task-type expansion in other multimodal benchmarks.

The low accuracy numbers are presented cleanly across 15 models. The paper does a service by flagging that many existing tests do not stress open visual inputs enough.

The soft spot is the question design process. Manually creating items that frontier models fail, then keeping them, risks selecting for the exact failure modes those models already have rather than sampling the visual-concept space evenly. The abstract gives no numbers on how many candidates were discarded or what the final difficulty distribution looks like, so the 64% figure could partly reflect the filter. The diversity metrics are also stated without the actual formulas or inter-rater details, which leaves the central claim under-supported.

This is a benchmark paper aimed at people who build or evaluate MLLMs and want tests that push visual coverage. Anyone working on evaluation design will find the taxonomy useful to look at, even if they end up modifying the question selection.

It should go to peer review. The construction choices need external scrutiny before the numbers can be treated as stable evidence of visual understanding gaps.

Referee Report

1 major / 2 minor

Summary. The paper introduces WorldBench, a multimodal reasoning benchmark for MLLMs. It builds a taxonomy of visual concepts, curates diverse images from search engines and datasets, and designs challenging questions via structured trial-and-error against frontier models. The work claims higher visual diversity than prior benchmarks on quantitative and human evaluations, and reports that 15 evaluated MLLMs achieve at most 64.0% accuracy (with some near chance), indicating weaknesses in visual understanding.

Significance. If the question selection process can be shown not to introduce bias toward current model failure modes, WorldBench could serve as a useful diagnostic for visual diversity gaps in MLLMs and inform more robust benchmark construction practices.

major comments (1)

[Abstract] Abstract: the claim that low accuracy (64.0% for the strongest model) reveals 'weaknesses in visual understanding' depends on the questions being a fair sample of the visual-concept space. The described procedure of 'structured trial-and-error' to retain only items that frontier MLLMs fail on creates a selection filter that may preferentially capture model-specific weaknesses rather than intrinsic visual-understanding deficits; this is load-bearing for the central interpretation of the results.

minor comments (2)

[Abstract] Abstract: the quantitative diversity metric used to claim superiority over existing benchmarks is referenced but not defined or reported with values, preventing verification of the diversity claim.
[Abstract] Abstract: the exact question design process, taxonomy details, and any controls for confounds (e.g., concept selection bias) are not specified, limiting assessment of whether the benchmark comprehensively tests visual understanding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important point regarding the interpretation of our results. We agree that the question selection procedure requires careful framing and will revise the abstract and related discussion to address the concern.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that low accuracy (64.0% for the strongest model) reveals 'weaknesses in visual understanding' depends on the questions being a fair sample of the visual-concept space. The described procedure of 'structured trial-and-error' to retain only items that frontier MLLMs fail on creates a selection filter that may preferentially capture model-specific weaknesses rather than intrinsic visual-understanding deficits; this is load-bearing for the central interpretation of the results.

Authors: We acknowledge the validity of this concern. The trial-and-error process is explicitly designed to produce questions that current frontier models cannot solve, which intentionally filters for items that expose limitations rather than representing a uniform or random sample of the visual-concept space. This means the benchmark is best interpreted as a diagnostic tool for current gaps rather than a comprehensive measure of intrinsic visual understanding deficits across all possible questions. We will revise the abstract to replace the phrasing 'reveals weaknesses in visual understanding' with language that more precisely states the results demonstrate limitations of existing MLLMs on visually diverse reasoning tasks. We will also expand the methods and discussion sections to describe the selection process more explicitly and note its implications for interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark paper contains no derivation chain

full rationale

The paper introduces WorldBench via taxonomy-guided curation and trial-and-error question design, followed by direct evaluation of 15 MLLMs. No equations, parameters, predictions, or first-principles derivations are present. Diversity claims rest on separate quantitative and human evaluations. The low-accuracy observation is a straightforward measurement on the constructed set rather than a reduction of any claimed result to its own inputs by construction. No self-citations or ansatzes are load-bearing in a mathematical sense. This is a standard benchmark paper whose central claims do not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only so the ledger is minimal; the central claim rests on the domain assumption that visual diversity drives real-world reliability.

axioms (1)

domain assumption Visual diversity across concepts is required for reliable performance on open-ended visual inputs in real-world applications.
Stated as the motivation in the first sentence of the abstract.

pith-pipeline@v0.9.1-grok · 5737 in / 1127 out tokens · 50012 ms · 2026-06-28T02:59:50.514788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

171 extracted references · 13 linked inside Pith

[1]

Agent VQA : A unified benchmark for agentic visual understanding

Anonymous. Agent VQA : A unified benchmark for agentic visual understanding. In Submitted to ICLR, 2025

2025
[2]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026

2026
[4]

Qwen3-vl technical report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, et al. Qwen3-vl technical report. arXiv preprint arXiv: 2511.21631, 2025

Pith/arXiv arXiv 2025
[5]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. In NeurIPS, 2025

2025
[6]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 1952

1952
[7]

Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021

2021
[8]

MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks

Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Ziyan Jiang, Wang Zhu, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, and Wenhu Chen. MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks. In ICLR, 2025

2025
[9]

Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

2024
[10]

Pali: A jointly-scaled multilingual language-image model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

2023
[11]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In ACL, 2024

2024
[12]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In ICML, 2024

2024
[13]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023

2023
[14]

Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter

Google DeepMind. Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/ , 2026

2026
[15]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database . In CVPR, 2009

2009
[16]

Bootstrap confidence intervals

Thomas J DiCiccio and Bradley Efron. Bootstrap confidence intervals. Statistical science, 1996

1996
[17]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010

2010
[18]

Wikipedia categories

Wikimedia Foundation. Wikipedia categories. https://en.wikipedia.org/wiki/Wikipedia:Contents/Categories , 2021

2021
[19]

Wikidata

Wikimedia Foundation. Wikidata. https://www.wikidata.org/wiki/Wikidata:Main_Page , 2023

2023
[20]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. TMLR, 2023

2023
[21]

MME : A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME : A comprehensive evaluation benchmark for multimodal large language models. In NeurIPS Datasets and Benchmarks Track, 2025 a

2025
[22]

Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

2025
[23]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In ECCV, 2024

2024
[24]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS Datasets and Benchmarks Track, 2023

2023
[25]

Glm-4.6v: Open source multimodal models with native tool use

GLM. Glm-4.6v: Open source multimodal models with native tool use. https://z.ai/blog/glm-4.6v , 2025 a

2025
[26]

Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning

GLM. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv: 2507.01006, 2025 b

Pith/arXiv arXiv 2025
[27]

Google trend categories

Google. Google trend categories. https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories , 2017

2017
[28]

Google product taxonomy

Google. Google product taxonomy. https://www.google.com/basepages/producttype/taxonomy.en-US.txt , 2021

2021
[29]

Gemini 3 pro best for complex tasks and bringing creative concepts to life

Google. Gemini 3 pro best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/ , 2025

2025
[30]

Gemini 3.1 pro: A smarter model for your most complex tasks

Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , 2026

2026
[31]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017

2017
[32]

Caltech-256 object category dataset

Gregory Griffin, Alex Holub, Pietro Perona, et al. Caltech-256 object category dataset. Technical Report, 2007

2007
[33]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, 2024

2024
[34]

Kimi-vl technical report

Kimi. Kimi-vl technical report. arXiv preprint arXiv: 2504.07491, 2025

Pith/arXiv arXiv 2025
[36]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[37]

Seed-bench-2: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023 a

arXiv 2023
[38]

Seed-bench: Benchmarking multimodal llms with generative comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. In CVPR, 2024

2024
[39]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia, 2025

2025
[40]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023 b

2023
[41]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

2014
[42]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

2023
[43]

Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

2024
[44]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. In NeurIPS, 2024

2024
[45]

Wordnet: a lexical database for english

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995

1995
[46]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

2026
[47]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024

2024
[48]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023
[49]

Canonical perspective and the perception of objects

Stephen E Palmer. Canonical perspective and the perception of objects. Attention and performance, 1981

1981
[50]

Qwen3.5: Towards native multimodal agents

Qwen. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5 , 2026

2026
[51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

2021
[54]

The effective rank: A measure of effective dimensionality

Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In European Signal Processing Conference, 2007

2007
[55]

LAION -5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION -5b: An open large-scale dataset for training next generation image-text mo...

2022
[56]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[57]

Design2code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. In NAACL, 2025

2025
[59]

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021

2021
[60]

YFCC100M : The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M : The new data in multimedia research. Communications of the ACM, 2016

2016
[61]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024

2024
[62]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2025

2025
[65]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. 2024

2024
[66]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

2022
[67]

Finevision: Open data is all you need

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need. arXiv preprint arXiv:2510.17269, 2025

Pith/arXiv arXiv 2025
[68]

Grok 4.2

xAI. Grok 4.2. https://docs.x.ai/developers/models , 2026

2026
[70]

Demystifying clip data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024 a

2024
[71]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. PAML, 2024 b

2024
[72]

Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation. In ICLR, 2025

2025
[73]

Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models toward...

2024
[74]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 a

2024
[75]

MM -vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM -vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 b

2024
[76]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024

2024
[77]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023

2023
[78]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , title =
[79]

Microsoft coco: Common objects in context , author=
[80]

Communications of the ACM , year=

WordNet: a lexical database for English , author=. Communications of the ACM , year=
[81]

IJCV , year=

The pascal visual object classes (voc) challenge , author=. IJCV , year=
[82]

2007 , journal=

Caltech-256 object category dataset , author=. 2007 , journal=

2007
[83]

NeurIPS , year=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. NeurIPS , year=
[84]

Anonymous , booktitle=. Agent. 2025 , url=

2025
[85]

Attention and performance , year=

Canonical Perspective and the perception of objects , author=. Attention and performance , year=
[86]

2024 , booktitle =

Ying, Kaining and Meng, Fanqing and Wang, Jin and Li, Zhiqian and Lin, Han and Yang, Yue and Zhang, Hao and Zhang, Wenbo and Lin, Yuqi and Liu, Shuo and Lei, Jiayi and Lu, Quanfeng and Chen, Runjian and Xu, Peng and Zhang, Renrui and Zhang, Haozhe and Gao, Peng and Wang, Yali and Qiao, Yu and Luo, Ping and Zhang, Kaipeng and Shao, Wenqi , title =. 2024 , ...

2024
[87]

2024 , journal=

Xu, Peng and Shao, Wenqi and Zhang, Kaipeng and Gao, Peng and Liu, Shuo and Lei, Meng and Meng, Fanqing and Huang, Siyuan and Qiao, Yu and Luo, Ping , title=. 2024 , journal=

2024
[88]

2024 , booktitle =

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. 2024 , booktitle =

2024

Showing first 80 references.

[1] [1]

Agent VQA : A unified benchmark for agentic visual understanding

Anonymous. Agent VQA : A unified benchmark for agentic visual understanding. In Submitted to ICLR, 2025

2025

[2] [2]

Introducing claude opus 4.7

Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026

2026

[3] [4]

Qwen3-vl technical report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, et al. Qwen3-vl technical report. arXiv preprint arXiv: 2511.21631, 2025

Pith/arXiv arXiv 2025

[4] [5]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. In NeurIPS, 2025

2025

[5] [6]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 1952

1952

[6] [7]

Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021

2021

[7] [8]

MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks

Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Ziyan Jiang, Wang Zhu, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, and Wenhu Chen. MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks. In ICLR, 2025

2025

[8] [9]

Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

2024

[9] [10]

Pali: A jointly-scaled multilingual language-image model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

2023

[10] [11]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In ACL, 2024

2024

[11] [12]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In ICML, 2024

2024

[12] [13]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023

2023

[13] [14]

Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter

Google DeepMind. Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/ , 2026

2026

[14] [15]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database . In CVPR, 2009

2009

[15] [16]

Bootstrap confidence intervals

Thomas J DiCiccio and Bradley Efron. Bootstrap confidence intervals. Statistical science, 1996

1996

[16] [17]

The pascal visual object classes (voc) challenge

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010

2010

[17] [18]

Wikipedia categories

Wikimedia Foundation. Wikipedia categories. https://en.wikipedia.org/wiki/Wikipedia:Contents/Categories , 2021

2021

[18] [19]

Wikidata

Wikimedia Foundation. Wikidata. https://www.wikidata.org/wiki/Wikidata:Main_Page , 2023

2023

[19] [20]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. TMLR, 2023

2023

[20] [21]

MME : A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME : A comprehensive evaluation benchmark for multimodal large language models. In NeurIPS Datasets and Benchmarks Track, 2025 a

2025

[21] [22]

Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

2025

[22] [23]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In ECCV, 2024

2024

[23] [24]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS Datasets and Benchmarks Track, 2023

2023

[24] [25]

Glm-4.6v: Open source multimodal models with native tool use

GLM. Glm-4.6v: Open source multimodal models with native tool use. https://z.ai/blog/glm-4.6v , 2025 a

2025

[25] [26]

Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning

GLM. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv: 2507.01006, 2025 b

Pith/arXiv arXiv 2025

[26] [27]

Google trend categories

Google. Google trend categories. https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories , 2017

2017

[27] [28]

Google product taxonomy

Google. Google product taxonomy. https://www.google.com/basepages/producttype/taxonomy.en-US.txt , 2021

2021

[28] [29]

Gemini 3 pro best for complex tasks and bringing creative concepts to life

Google. Gemini 3 pro best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/ , 2025

2025

[29] [30]

Gemini 3.1 pro: A smarter model for your most complex tasks

Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , 2026

2026

[30] [31]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017

2017

[31] [32]

Caltech-256 object category dataset

Gregory Griffin, Alex Holub, Pietro Perona, et al. Caltech-256 object category dataset. Technical Report, 2007

2007

[32] [33]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, 2024

2024

[33] [34]

Kimi-vl technical report

Kimi. Kimi-vl technical report. arXiv preprint arXiv: 2504.07491, 2025

Pith/arXiv arXiv 2025

[34] [36]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[35] [37]

Seed-bench-2: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023 a

arXiv 2023

[36] [38]

Seed-bench: Benchmarking multimodal llms with generative comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. In CVPR, 2024

2024

[37] [39]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia, 2025

2025

[38] [40]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023 b

2023

[39] [41]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

2014

[40] [42]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

2023

[41] [43]

Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

2024

[42] [44]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. In NeurIPS, 2024

2024

[43] [45]

Wordnet: a lexical database for english

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995

1995

[44] [46]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

2026

[45] [47]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024

2024

[46] [48]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023

[47] [49]

Canonical perspective and the perception of objects

Stephen E Palmer. Canonical perspective and the perception of objects. Attention and performance, 1981

1981

[48] [50]

Qwen3.5: Towards native multimodal agents

Qwen. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5 , 2026

2026

[49] [51]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

2021

[50] [54]

The effective rank: A measure of effective dimensionality

Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In European Signal Processing Conference, 2007

2007

[51] [55]

LAION -5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION -5b: An open large-scale dataset for training next generation image-text mo...

2022

[52] [56]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[53] [57]

Design2code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. In NAACL, 2025

2025

[54] [59]

Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021

2021

[55] [60]

YFCC100M : The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M : The new data in multimedia research. Communications of the ACM, 2016

2016

[56] [61]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024

2024

[57] [62]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2025

2025

[58] [65]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. 2024

2024

[59] [66]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

2022

[60] [67]

Finevision: Open data is all you need

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need. arXiv preprint arXiv:2510.17269, 2025

Pith/arXiv arXiv 2025

[61] [68]

Grok 4.2

xAI. Grok 4.2. https://docs.x.ai/developers/models , 2026

2026

[62] [70]

Demystifying clip data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024 a

2024

[63] [71]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. PAML, 2024 b

2024

[64] [72]

Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation. In ICLR, 2025

2025

[65] [73]

Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models toward...

2024

[66] [74]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 a

2024

[67] [75]

MM -vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM -vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 b

2024

[68] [76]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024

2024

[69] [77]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023

2023

[70] [78]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , title =

[71] [79]

Microsoft coco: Common objects in context , author=

[72] [80]

Communications of the ACM , year=

WordNet: a lexical database for English , author=. Communications of the ACM , year=

[73] [81]

IJCV , year=

The pascal visual object classes (voc) challenge , author=. IJCV , year=

[74] [82]

2007 , journal=

Caltech-256 object category dataset , author=. 2007 , journal=

2007

[75] [83]

NeurIPS , year=

Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. NeurIPS , year=

[76] [84]

Anonymous , booktitle=. Agent. 2025 , url=

2025

[77] [85]

Attention and performance , year=

Canonical Perspective and the perception of objects , author=. Attention and performance , year=

[78] [86]

2024 , booktitle =

Ying, Kaining and Meng, Fanqing and Wang, Jin and Li, Zhiqian and Lin, Han and Yang, Yue and Zhang, Hao and Zhang, Wenbo and Lin, Yuqi and Liu, Shuo and Lei, Jiayi and Lu, Quanfeng and Chen, Runjian and Xu, Peng and Zhang, Renrui and Zhang, Haozhe and Gao, Peng and Wang, Yali and Qiao, Yu and Luo, Ping and Zhang, Kaipeng and Shao, Wenqi , title =. 2024 , ...

2024

[79] [87]

2024 , journal=

Xu, Peng and Shao, Wenqi and Zhang, Kaipeng and Gao, Peng and Liu, Shuo and Lei, Meng and Meng, Fanqing and Huang, Siyuan and Qiao, Yu and Luo, Ping , title=. 2024 , journal=

2024

[80] [88]

2024 , booktitle =

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. 2024 , booktitle =

2024