arxiv: 2604.04009 · v1 · submitted 2026-04-05 · 💻 cs.SE

Recognition: no theorem link

Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

Shuyin Ouyang , Jie M. Zhang , Jingzhi Gong , Gunel Jahangirova , Mohammad Reza Mousavi , Jack Johns , Beum Seuk Lee , Adam Ziolkowski

show 2 more authors

Botond Virginas Joost Noppen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:11 UTC · model grok-4.3

classification 💻 cs.SE

keywords SADU benchmarksoftware architecture diagramsvision-language modelsdiagram understandingVLMssoftware engineeringvisual relation groundingbenchmark evaluation

0 comments

The pith

Vision-language models reach at most 70 percent accuracy on software architecture diagram tasks in a new benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Software architecture diagrams serve as key artifacts for communicating system structure, behavior, and data throughout development, yet vision-language models have received little targeted testing on them. The paper introduces the SADU benchmark, built from 154 curated diagrams across behavioral, structural, and ER types together with 2,431 structured question-answer pairs focused on counting and retrieval. Evaluation of eleven current VLMs shows the strongest result at 70.18 percent accuracy for gemini-3-flash-preview and only 17.77 percent for gpt-4o-mini. The performance gap traces to specific shortfalls in diagram reasoning and visual relation grounding. This establishes a concrete baseline for measuring progress toward diagram-aware systems that can support design-stage engineering work.

Core claim

The paper presents SADU as a benchmark containing 154 carefully curated software architecture diagrams of behavioral, structural, and ER types, each paired with structured annotations and 2,431 question-answer tasks focused on counting and retrieval reasoning. Evaluation across 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families demonstrates that even the top performer, gemini-3-flash-preview, attains only 70.18 percent accuracy while gpt-4o-mini reaches just 17.77 percent, exposing clear limitations in diagram reasoning and visual relation grounding.

What carries the argument

The SADU benchmark of 154 diagrams and 2,431 counting and retrieval questions that probes VLMs on structured software engineering artifacts rather than generic images.

If this is right

Software architecture diagram understanding remains challenging for all tested state-of-the-art VLMs.
Models show particular weaknesses in visual relation grounding and structured diagram reasoning.
SADU supplies a repeatable test for tracking improvements in diagram-aware AI systems.
Current performance levels fall short of the reliability needed for faithful AI assistance in design-stage workflows.
Progress on SADU tasks would directly support more consistent AI use across the software development lifecycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

SADU-style questions could be added to VLM pre-training mixtures to reduce the domain gap between general images and engineering diagrams.
Hybrid pipelines that combine diagram parsing with code analysis might compensate for the isolated weaknesses shown here.
Extending the benchmark to include sequence or deployment diagrams would test whether the same relation-grounding shortfalls appear in other standard software notations.
Low scores on ER diagrams suggest that data-modeling understanding may lag behind structural understanding in current VLMs.

Load-bearing premise

The 154 curated diagrams and 2,431 questions sufficiently represent the range of real-world software architecture diagrams and the reasoning skills needed by practicing engineers.

What would settle it

A VLM achieving over 90 percent accuracy on the full SADU set after training only on general image-text data would indicate that the observed limitations are not inherent to current architectures.

Figures

Figures reproduced from arXiv: 2604.04009 by Adam Ziolkowski, Beum Seuk Lee, Botond Virginas, Gunel Jahangirova, Jack Johns, Jie M. Zhang, Jingzhi Gong, Joost Noppen, Mohammad Reza Mousavi, Shuyin Ouyang.

**Figure 2.** Figure 2: Example prompt used for requesting VLMs. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: RQ1: Model accuracy across diagram complexity. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 3.** Figure 3: RQ1: Performance across question subtypes. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Software architecture diagrams are important design artifacts for communicating system structure, behavior, and data organization throughout the software development lifecycle. Although recent progress in large language models has substantially advanced code-centric software engineering tasks such as code generation, testing, and maintenance, the ability of modern vision-language models (VLMs) to understand software architecture diagrams remains underexplored. To address this gap, we present SADU, a benchmark for Software Architecture Diagram Understanding that evaluates VLMs on architecture diagrams as structured software engineering artifacts rather than generic images. SADU contains 154 carefully curated diagrams spanning behavioral, structural, and ER diagrams, paired with structured annotations and 2,431 question-answer tasks covering counting and retrieval reasoning. We evaluate 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families. Our results show that software architecture diagram understanding remains challenging for current models: the best-performing model gemini-3-flash-preview achieves only 70.18\% accuracy, while gpt-4o-mini only achieves 17.77\% accuracy. The results further reveal the weaknesses in diagram reasoning and visual relation grounding, highlighting a gap between current VLMs and the needs of design-stage software engineering. SADU provides a foundation for future research on diagram-aware AI systems and more faithful AI-assisted software engineering workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SADU gives a clean first benchmark on VLMs for architecture diagrams and the 70% ceiling looks real, but the 154-diagram set needs more checks before the real-world gap claim lands hard.

read the letter

The paper's main contribution is SADU, a new benchmark with 154 diagrams and 2431 questions aimed at counting, retrieval, and relation tasks on software architecture artifacts. They run 11 VLMs from the usual families and report that the strongest one, gemini-3-flash-preview, reaches only 70.18% while gpt-4o-mini sits at 17.77%. That performance spread is the clearest signal in the work and directly supports the claim that current models still struggle with diagram reasoning and visual relations in this domain. The construction looks straightforward: diagrams are split across behavioral, structural, and ER types with structured annotations, and the evaluation avoids post-hoc exclusions or parameter tuning in the reported results. No circular derivations or invented entities appear. The numbers are useful as a starting point for anyone building diagram-aware tools for design-stage software engineering. The soft spot is representativeness. The authors call the diagrams carefully curated, yet the text gives no quantitative breakdown of element counts, relation density, notation variants, or comparison against open-source repositories or industry examples. If the set skews toward simpler or more uniform cases than typical practitioner diagrams, the observed ceiling and the identified weaknesses in visual grounding may not generalize as far as the abstract suggests. That assumption is reasonable for an initial benchmark but remains untested in the provided details. This work is aimed at researchers who need concrete baselines for VLM performance on structured engineering diagrams rather than generic images. It is worth a serious referee because the benchmark itself is new, the evaluation is transparent, and the gap it documents is practically relevant even if later papers will need to tighten the sampling argument.

Referee Report

2 major / 2 minor

Summary. The paper introduces SADU, a benchmark for Software Architecture Diagram Understanding consisting of 154 carefully curated diagrams spanning behavioral, structural, and ER types, paired with 2,431 question-answer tasks focused on counting and retrieval reasoning. It evaluates 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families, reporting that the best model (Gemini-3-flash-preview) reaches only 70.18% accuracy while GPT-4o-mini reaches 17.77%, and concludes that current VLMs exhibit weaknesses in diagram reasoning and visual relation grounding, highlighting a gap for design-stage software engineering.

Significance. If the benchmark holds, the work is significant for providing the first dedicated empirical evaluation of VLMs on software architecture diagrams as structured SE artifacts rather than generic images. The multi-model comparison across 11 systems with no post-hoc exclusions or parameter fitting supplies a reproducible baseline that can guide future diagram-aware AI development in software engineering.

major comments (2)

[§3 (Dataset Construction)] §3 (Dataset Construction): The claim that the 154 diagrams sufficiently represent real-world software architecture tasks rests on the description of them as 'carefully curated' spanning behavioral/structural/ER types, but the section provides no quantitative validation such as distributions of element counts, relation density, notation variants, or direct comparison to external corpora (e.g., open-source GitHub repositories). This directly affects the generalizability of the reported performance ceiling and the identified weaknesses in visual relation grounding.
[§5 (Results)] §5 (Results): The interpretation that low accuracies reveal 'weaknesses in diagram reasoning and visual relation grounding' is plausible from the aggregate numbers, but the section lacks a per-question-type or per-diagram-type error breakdown that would isolate whether failures stem from visual grounding, counting, or relation extraction; without this, the precise nature of the gap remains underspecified.

minor comments (2)

[Abstract and §4] Abstract and §4: The mention of 'structured annotations' should be expanded with a brief description of their format (e.g., whether they include explicit relation graphs or bounding boxes) to allow readers to assess how the 2,431 tasks were derived.
[Results tables] Results tables: Adding 95% confidence intervals or standard deviations to the reported accuracy percentages would strengthen statistical interpretation of the model comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements to strengthen the manuscript.

read point-by-point responses

Referee: §3 (Dataset Construction): The claim that the 154 diagrams sufficiently represent real-world software architecture tasks rests on the description of them as 'carefully curated' spanning behavioral/structural/ER types, but the section provides no quantitative validation such as distributions of element counts, relation density, notation variants, or direct comparison to external corpora (e.g., open-source GitHub repositories). This directly affects the generalizability of the reported performance ceiling and the identified weaknesses in visual relation grounding.

Authors: We agree that quantitative validation would improve the characterization of the benchmark's representativeness. In the revised manuscript, we will expand §3 with a new table and accompanying text providing summary statistics on element counts (nodes, edges, labels), relation densities, and notation variants across the 154 diagrams. Where feasible, we will also include a brief comparison to a sampled set of architecture diagrams from public GitHub repositories to support generalizability claims. revision: yes
Referee: §5 (Results): The interpretation that low accuracies reveal 'weaknesses in diagram reasoning and visual relation grounding' is plausible from the aggregate numbers, but the section lacks a per-question-type or per-diagram-type error breakdown that would isolate whether failures stem from visual grounding, counting, or relation extraction; without this, the precise nature of the gap remains underspecified.

Authors: We agree that a finer-grained error analysis would better isolate the sources of model failures. In the revision, we will augment §5 with per-question-type accuracy breakdowns (counting vs. retrieval) and per-diagram-type results (behavioral, structural, ER). We will also add a short qualitative subsection with representative error examples to clarify whether issues arise primarily from visual grounding, counting, or relation extraction. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct model evaluations

full rationale

The paper creates the SADU benchmark (154 diagrams, 2431 QA tasks) and reports accuracy numbers for 11 external VLMs (Gemini, Claude, GPT, Qwen families). No equations, derivations, fitted parameters, predictions, or self-citations are used to generate the central results; accuracies are measured outcomes on the curated test set. The work is self-contained against external models and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work rests on standard practices for benchmark curation and model evaluation.

pith-pipeline@v0.9.0 · 5574 in / 960 out tokens · 45653 ms · 2026-05-13T17:11:53.855171+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

(How) Do Large Language Models Understand High-Level Message Sequence Charts?
cs.SE 2026-05 conditional novelty 6.0

LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
cs.SE 2026-05 unverdicted novelty 5.0

LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash

work page
[2]

https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash-lite

work page
[3]

https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview

work page
[4]

https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite-preview

work page
[5]

https://aws.amazon.com/what-is/sdlc/#how-does-sdlc-work–f1ezvt

work page
[6]

https://developers.openai.com/api/docs/models/gpt-4o-mini

work page
[7]

https://developers.openai.com/api/docs/models/gpt-5-nano

work page
[8]

https://developers.openai.com/api/docs/models/gpt-5.4

work page
[9]

https://doi.org/10.5281/zenodo.19339991

work page doi:10.5281/zenodo.19339991
[10]

https://github.com/matovaro/pyunml-dataset

work page
[11]

https://learn.microsoft.com/en-us/azure/architecture/

work page
[12]

https://mermaid.com/

work page
[13]

https://qwen.ai/blog?id=qwen2.5-vl

work page
[14]

https://www.anthropic.com/news/claude-haiku-4-5

work page
[15]

https://www.anthropic.com/news/claude-sonnet-4-5

work page
[16]

https://www.figma.com

work page
[17]

https://www.lucidchart.com

work page
[18]

https://www.miro.com

work page
[19]

State of software architecture report - 2024, 2024

work page 2024
[20]

State of software architecture report — 2025, 2026

work page 2025
[21]

Automatically recognizing the semantic elements from uml class diagram images.Journal of Systems and Software, 193:111431, 2022

Fangwei Chen, Li Zhang, Xiaoli Lian, and Nan Niu. Automatically recognizing the semantic elements from uml class diagram images.Journal of Systems and Software, 193:111431, 2022

work page 2022
[22]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Gonzalez , title =

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv preprint arXiv:2502.08235, 2025

work page arXiv 2025
[25]

Draft-ing architectural design decisions using llms.arXiv preprint arXiv:2504.08207, 2025

Rudra Dhar, Adyansh Kakran, Amey Karan, Karthik Vaidhyanathan, and Va- sudeva Varma. Draft-ing architectural design decisions using llms.arXiv preprint arXiv:2504.08207, 2025

work page arXiv 2025
[26]

Do vision-language models really understand visual language?arXiv preprint arXiv:2410.00193, 2024

Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual language?arXiv preprint arXiv:2410.00193, 2024

work page arXiv 2024
[27]

Mitigating overthinking in large reasoning models via manifold steering.arXiv preprint arXiv:2505.22411, 2025

Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, and Yinpeng Dong. Mitigating overthinking in large reasoning models via manifold steering.arXiv preprint arXiv:2505.22411, 2025

work page arXiv 2025
[28]

Will generative ai fill the automation gap in software architecting? In2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), pages 41–45

James Ivers and Ipek Ozkaya. Will generative ai fill the automation gap in software architecting? In2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), pages 41–45. IEEE, 2025

work page 2025
[29]

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang, Zhibo Yang, Junyang Lin, Yu Gu, Ge Yu, and Maosong Sun. Unikie-bench: Benchmarking large multimodal models for key information extraction in visual documents. arXiv preprint arXiv:2602.07038, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Effects of defects in uml models: an experimental investigation

Christian FJ Lange and Michel RV Chaudron. Effects of defects in uml models: an experimental investigation. InProceedings of the 28th international conference on Software engineering, pages 401–411, 2006

work page 2006
[32]

Devbench: A comprehensive benchmark for software development

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. Devbench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3, 2024

work page arXiv 2024
[33]

On the perception bottleneck of vlms for chart understanding.arXiv preprint arXiv:2503.18435, 2025

Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, and Junxian He. On the perception bottleneck of vlms for chart understanding.arXiv preprint arXiv:2503.18435, 2025

work page arXiv 2025
[34]

Wildvision: Evaluating vision-language models in the wild with human preferences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024

Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024

work page 2024
[35]

Argus: Vision-centric reasoning with grounded chain-of-thought

Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14268–14280, 2025

work page 2025
[36]

Docvlm: Make your vlm an efficient reader

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. Docvlm: Make your vlm an efficient reader. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29005–29015, 2025. Shuyin Ouyang, Jie M. Zhang, Jingzhi Gong, Gunel Jahangirova, Mohammad Reza Mousavi, Jack Johns...

work page 2025
[37]

A survey into the rigor of uml use and its perceived impact on quality and productivity

Ariadi Nugroho and Michel RV Chaudron. A survey into the rigor of uml use and its perceived impact on quality and productivity. InProceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pages 90–99, 2008

work page 2008
[38]

Dscodebench: A realistic benchmark for data science code generation

Shuyin Ouyang, Dong Huang, Jingwen Guo, Zeyu Sun, Qihao Zhu, and Jie M Zhang. Dscodebench: A realistic benchmark for data science code generation. arXiv preprint arXiv:2505.15621, 2025

work page arXiv 2025
[39]

Math blind: Failures in diagram understanding undermine reasoning in mllms.arXiv preprint arXiv:2503.20745, 2025

Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, and Anton van den Hengel. Math blind: Failures in diagram understanding undermine reasoning in mllms.arXiv preprint arXiv:2503.20745, 2025

work page arXiv 2025
[40]

From Charts to Code: A Hierarchical Benchmark for Multimodal Models

Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, and Alex Jinpeng Wang. From charts to code: A hierarchical benchmark for multimodal models.arXiv preprint arXiv:2510.17932, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Document intelligence in the era of large language models: A survey.arXiv preprint arXiv:2510.13366, 2025

Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, and Daniel Dahlmeier. Document intelligence in the era of large language models: A survey.arXiv preprint arXiv:2510.13366, 2025

work page arXiv 2025
[42]

Testeval: Benchmarking large language models for test case generation

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3547–3562, 2025

work page 2025
[43]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

work page 2024
[44]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[45]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

work page 2024
[46]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

work page 2025