pith. sign in

arxiv: 2606.06538 · v1 · pith:3QP63GBAnew · submitted 2026-06-04 · 💻 cs.CV

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Pith reviewed 2026-06-28 02:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords WorldBenchmultimodal benchmarkvisual diversityMLLM evaluationvisual understandingreasoning benchmarktaxonomy curationimage diversity
0
0 comments X

The pith

WorldBench uses a visual concept taxonomy to curate diverse images and shows even top MLLMs reach only 64 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates WorldBench to fix the gap in existing multimodal benchmarks, which add tasks but rarely capture the range of real-world visual inputs. It builds a taxonomy spanning thousands of concepts, pulls images from search engines and datasets to match that taxonomy, and writes hard questions that current models miss. Quantitative and human checks confirm the new benchmark has more visual diversity than prior ones. When 15 MLLMs are tested, the strongest scores 64 percent while others stay near chance level. The work shows why visual diversity matters for reliable performance outside narrow test sets.

Core claim

WorldBench is built by defining a taxonomy of visual concepts across domains, then manually selecting images and writing questions that frontier models fail. This produces a benchmark with measurably higher visual diversity than existing ones. Evaluation of 15 MLLMs finds the best model at 64.0 percent accuracy, with several others only marginally above random guessing, exposing limits in current visual understanding.

What carries the argument

A taxonomy of thousands of visual concepts that guides curation of images from search engines and existing datasets, combined with trial-and-error manual question design.

If this is right

  • MLLMs require stronger mechanisms for handling varied visual inputs to reach reliable real-world performance.
  • Benchmark construction should prioritize explicit coverage of visual concepts over simply adding more task types.
  • Models that score near chance on diverse images are unlikely to generalize safely to open-ended visual settings.
  • Evaluation protocols for multimodal systems should include diversity metrics alongside accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data that lacks similar visual breadth may be a root cause of the observed performance gaps.
  • The benchmark could serve as a filter for selecting models intended for deployment in uncontrolled visual environments.
  • Extending the taxonomy approach to video or 3D inputs would test whether the same diversity issues appear in other modalities.
  • Human performance baselines on the same questions would clarify how far current models remain from human-level visual reasoning.

Load-bearing premise

The manually chosen images and questions, shaped by the taxonomy, test general visual understanding rather than just the specific concepts or sources selected.

What would settle it

Quantitative diversity scores showing WorldBench is not higher than prior benchmarks, or a new model reaching above 80 percent accuracy while still performing strongly on older tests.

read the original abstract

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces WorldBench, a multimodal reasoning benchmark for MLLMs. It builds a taxonomy of visual concepts, curates diverse images from search engines and datasets, and designs challenging questions via structured trial-and-error against frontier models. The work claims higher visual diversity than prior benchmarks on quantitative and human evaluations, and reports that 15 evaluated MLLMs achieve at most 64.0% accuracy (with some near chance), indicating weaknesses in visual understanding.

Significance. If the question selection process can be shown not to introduce bias toward current model failure modes, WorldBench could serve as a useful diagnostic for visual diversity gaps in MLLMs and inform more robust benchmark construction practices.

major comments (1)
  1. [Abstract] Abstract: the claim that low accuracy (64.0% for the strongest model) reveals 'weaknesses in visual understanding' depends on the questions being a fair sample of the visual-concept space. The described procedure of 'structured trial-and-error' to retain only items that frontier MLLMs fail on creates a selection filter that may preferentially capture model-specific weaknesses rather than intrinsic visual-understanding deficits; this is load-bearing for the central interpretation of the results.
minor comments (2)
  1. [Abstract] Abstract: the quantitative diversity metric used to claim superiority over existing benchmarks is referenced but not defined or reported with values, preventing verification of the diversity claim.
  2. [Abstract] Abstract: the exact question design process, taxonomy details, and any controls for confounds (e.g., concept selection bias) are not specified, limiting assessment of whether the benchmark comprehensively tests visual understanding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important point regarding the interpretation of our results. We agree that the question selection procedure requires careful framing and will revise the abstract and related discussion to address the concern.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that low accuracy (64.0% for the strongest model) reveals 'weaknesses in visual understanding' depends on the questions being a fair sample of the visual-concept space. The described procedure of 'structured trial-and-error' to retain only items that frontier MLLMs fail on creates a selection filter that may preferentially capture model-specific weaknesses rather than intrinsic visual-understanding deficits; this is load-bearing for the central interpretation of the results.

    Authors: We acknowledge the validity of this concern. The trial-and-error process is explicitly designed to produce questions that current frontier models cannot solve, which intentionally filters for items that expose limitations rather than representing a uniform or random sample of the visual-concept space. This means the benchmark is best interpreted as a diagnostic tool for current gaps rather than a comprehensive measure of intrinsic visual understanding deficits across all possible questions. We will revise the abstract to replace the phrasing 'reveals weaknesses in visual understanding' with language that more precisely states the results demonstrate limitations of existing MLLMs on visually diverse reasoning tasks. We will also expand the methods and discussion sections to describe the selection process more explicitly and note its implications for interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark paper contains no derivation chain

full rationale

The paper introduces WorldBench via taxonomy-guided curation and trial-and-error question design, followed by direct evaluation of 15 MLLMs. No equations, parameters, predictions, or first-principles derivations are present. Diversity claims rest on separate quantitative and human evaluations. The low-accuracy observation is a straightforward measurement on the constructed set rather than a reduction of any claimed result to its own inputs by construction. No self-citations or ansatzes are load-bearing in a mathematical sense. This is a standard benchmark paper whose central claims do not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only so the ledger is minimal; the central claim rests on the domain assumption that visual diversity drives real-world reliability.

axioms (1)
  • domain assumption Visual diversity across concepts is required for reliable performance on open-ended visual inputs in real-world applications.
    Stated as the motivation in the first sentence of the abstract.

pith-pipeline@v0.9.1-grok · 5737 in / 1127 out tokens · 50012 ms · 2026-06-28T02:59:50.514788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

171 extracted references · 13 linked inside Pith

  1. [1]

    Agent VQA : A unified benchmark for agentic visual understanding

    Anonymous. Agent VQA : A unified benchmark for agentic visual understanding. In Submitted to ICLR, 2025

  2. [2]

    Introducing claude opus 4.7

    Anthropic. Introducing claude opus 4.7. https://www.anthropic.com/news/claude-opus-4-7 , 2026

  3. [4]

    Qwen3-vl technical report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, et al. Qwen3-vl technical report. arXiv preprint arXiv: 2511.21631, 2025

  4. [5]

    Perception encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. In NeurIPS, 2025

  5. [6]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 1952

  6. [7]

    Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M : Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021

  7. [8]

    MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks

    Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Ziyan Jiang, Wang Zhu, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, and Wenhu Chen. MEGA -bench: Scaling multimodal evaluation to over 500 real-world tasks. In ICLR, 2025

  8. [9]

    Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? In NeurIPS, 2024

  9. [10]

    Pali: A jointly-scaled multilingual language-image model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

  10. [11]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In ACL, 2024

  11. [12]

    Chatbot arena: An open platform for evaluating llms by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. In ICML, 2024

  12. [13]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023

  13. [14]

    Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter

    Google DeepMind. Gemma 4: Our most intelligent open models, built from gemini 3 research and technology to maximize intelligence-per-parameter. https://deepmind.google/models/gemma/gemma-4/ , 2026

  14. [15]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database . In CVPR, 2009

  15. [16]

    Bootstrap confidence intervals

    Thomas J DiCiccio and Bradley Efron. Bootstrap confidence intervals. Statistical science, 1996

  16. [17]

    The pascal visual object classes (voc) challenge

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010

  17. [18]

    Wikipedia categories

    Wikimedia Foundation. Wikipedia categories. https://en.wikipedia.org/wiki/Wikipedia:Contents/Categories , 2021

  18. [19]

    Wikidata

    Wikimedia Foundation. Wikidata. https://www.wikidata.org/wiki/Wikidata:Main_Page , 2023

  19. [20]

    The vendi score: A diversity evaluation metric for machine learning

    Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. TMLR, 2023

  20. [21]

    MME : A comprehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. MME : A comprehensive evaluation benchmark for multimodal large language models. In NeurIPS Datasets and Benchmarks Track, 2025 a

  21. [22]

    Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

  22. [23]

    Smith, Wei-Chiu Ma, and Ranjay Krishna

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. In ECCV, 2024

  23. [24]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. In NeurIPS Datasets and Benchmarks Track, 2023

  24. [25]

    Glm-4.6v: Open source multimodal models with native tool use

    GLM. Glm-4.6v: Open source multimodal models with native tool use. https://z.ai/blog/glm-4.6v , 2025 a

  25. [26]

    Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning

    GLM. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning. arXiv preprint arXiv: 2507.01006, 2025 b

  26. [27]

    Google trend categories

    Google. Google trend categories. https://github.com/pat310/google-trends-api/wiki/Google-Trends-Categories , 2017

  27. [28]

    Google product taxonomy

    Google. Google product taxonomy. https://www.google.com/basepages/producttype/taxonomy.en-US.txt , 2021

  28. [29]

    Gemini 3 pro best for complex tasks and bringing creative concepts to life

    Google. Gemini 3 pro best for complex tasks and bringing creative concepts to life. https://deepmind.google/models/gemini/pro/ , 2025

  29. [30]

    Gemini 3.1 pro: A smarter model for your most complex tasks

    Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ , 2026

  30. [31]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017

  31. [32]

    Caltech-256 object category dataset

    Gregory Griffin, Alex Holub, Pietro Perona, et al. Caltech-256 object category dataset. Technical Report, 2007

  32. [33]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, 2024

  33. [34]

    Kimi-vl technical report

    Kimi. Kimi-vl technical report. arXiv preprint arXiv: 2504.07491, 2025

  34. [36]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  35. [37]

    Seed-bench-2: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023 a

  36. [38]

    Seed-bench: Benchmarking multimodal llms with generative comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. In CVPR, 2024

  37. [39]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. In Proceedings of the 33rd ACM International Conference on Multimedia, 2025

  38. [40]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In EMNLP, 2023 b

  39. [41]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

  40. [42]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  41. [43]

    Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In ECCV, 2024

  42. [44]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. In NeurIPS, 2024

  43. [45]

    Wordnet: a lexical database for english

    George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995

  44. [46]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026

  45. [47]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024

  46. [48]

    Teaching clip to count to ten

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  47. [49]

    Canonical perspective and the perception of objects

    Stephen E Palmer. Canonical perspective and the perception of objects. Attention and performance, 1981

  48. [50]

    Qwen3.5: Towards native multimodal agents

    Qwen. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5 , 2026

  49. [51]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

  50. [54]

    The effective rank: A measure of effective dimensionality

    Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In European Signal Processing Conference, 2007

  51. [55]

    LAION -5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION -5b: An open large-scale dataset for training next generation image-text mo...

  52. [56]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  53. [57]

    Design2code: Benchmarking multimodal code generation for automated front-end engineering

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. In NAACL, 2025

  54. [59]

    Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021

  55. [60]

    YFCC100M : The new data in multimedia research

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M : The new data in multimedia research. Communications of the ACM, 2016

  56. [61]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR, 2024

  57. [62]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2025

  58. [65]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. 2024

  59. [66]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

  60. [67]

    Finevision: Open data is all you need

    Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need. arXiv preprint arXiv:2510.17269, 2025

  61. [68]

    Grok 4.2

    xAI. Grok 4.2. https://docs.x.ai/developers/models , 2026

  62. [70]

    Demystifying clip data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. In ICLR, 2024 a

  63. [71]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. PAML, 2024 b

  64. [72]

    Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation

    Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, and Yujiu Yang. Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation. In ICLR, 2025

  65. [73]

    Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, Ping Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: a comprehensive multimodal benchmark for evaluating large vision-language models toward...

  66. [74]

    Mm-vet: Evaluating large multimodal models for integrated capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 a

  67. [75]

    MM -vet: Evaluating large multimodal models for integrated capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM -vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024 b

  68. [76]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024

  69. [77]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023

  70. [78]

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , title =

  71. [79]

    Microsoft coco: Common objects in context , author=

  72. [80]

    Communications of the ACM , year=

    WordNet: a lexical database for English , author=. Communications of the ACM , year=

  73. [81]

    IJCV , year=

    The pascal visual object classes (voc) challenge , author=. IJCV , year=

  74. [82]

    2007 , journal=

    Caltech-256 object category dataset , author=. 2007 , journal=

  75. [83]

    NeurIPS , year=

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms , author=. NeurIPS , year=

  76. [84]

    Anonymous , booktitle=. Agent. 2025 , url=

  77. [85]

    Attention and performance , year=

    Canonical Perspective and the perception of objects , author=. Attention and performance , year=

  78. [86]

    2024 , booktitle =

    Ying, Kaining and Meng, Fanqing and Wang, Jin and Li, Zhiqian and Lin, Han and Yang, Yue and Zhang, Hao and Zhang, Wenbo and Lin, Yuqi and Liu, Shuo and Lei, Jiayi and Lu, Quanfeng and Chen, Runjian and Xu, Peng and Zhang, Renrui and Zhang, Haozhe and Gao, Peng and Wang, Yali and Qiao, Yu and Luo, Ping and Zhang, Kaipeng and Shao, Wenqi , title =. 2024 , ...

  79. [87]

    2024 , journal=

    Xu, Peng and Shao, Wenqi and Zhang, Kaipeng and Gao, Peng and Liu, Shuo and Lei, Meng and Meng, Fanqing and Huang, Siyuan and Qiao, Yu and Luo, Ping , title=. 2024 , journal=

  80. [88]

    2024 , booktitle =

    Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. 2024 , booktitle =

Showing first 80 references.