Recognition: no theorem link
Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding
Pith reviewed 2026-05-13 17:11 UTC · model grok-4.3
The pith
Vision-language models reach at most 70 percent accuracy on software architecture diagram tasks in a new benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents SADU as a benchmark containing 154 carefully curated software architecture diagrams of behavioral, structural, and ER types, each paired with structured annotations and 2,431 question-answer tasks focused on counting and retrieval reasoning. Evaluation across 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families demonstrates that even the top performer, gemini-3-flash-preview, attains only 70.18 percent accuracy while gpt-4o-mini reaches just 17.77 percent, exposing clear limitations in diagram reasoning and visual relation grounding.
What carries the argument
The SADU benchmark of 154 diagrams and 2,431 counting and retrieval questions that probes VLMs on structured software engineering artifacts rather than generic images.
If this is right
- Software architecture diagram understanding remains challenging for all tested state-of-the-art VLMs.
- Models show particular weaknesses in visual relation grounding and structured diagram reasoning.
- SADU supplies a repeatable test for tracking improvements in diagram-aware AI systems.
- Current performance levels fall short of the reliability needed for faithful AI assistance in design-stage workflows.
- Progress on SADU tasks would directly support more consistent AI use across the software development lifecycle.
Where Pith is reading between the lines
- SADU-style questions could be added to VLM pre-training mixtures to reduce the domain gap between general images and engineering diagrams.
- Hybrid pipelines that combine diagram parsing with code analysis might compensate for the isolated weaknesses shown here.
- Extending the benchmark to include sequence or deployment diagrams would test whether the same relation-grounding shortfalls appear in other standard software notations.
- Low scores on ER diagrams suggest that data-modeling understanding may lag behind structural understanding in current VLMs.
Load-bearing premise
The 154 curated diagrams and 2,431 questions sufficiently represent the range of real-world software architecture diagrams and the reasoning skills needed by practicing engineers.
What would settle it
A VLM achieving over 90 percent accuracy on the full SADU set after training only on general image-text data would indicate that the observed limitations are not inherent to current architectures.
Figures
read the original abstract
Software architecture diagrams are important design artifacts for communicating system structure, behavior, and data organization throughout the software development lifecycle. Although recent progress in large language models has substantially advanced code-centric software engineering tasks such as code generation, testing, and maintenance, the ability of modern vision-language models (VLMs) to understand software architecture diagrams remains underexplored. To address this gap, we present SADU, a benchmark for Software Architecture Diagram Understanding that evaluates VLMs on architecture diagrams as structured software engineering artifacts rather than generic images. SADU contains 154 carefully curated diagrams spanning behavioral, structural, and ER diagrams, paired with structured annotations and 2,431 question-answer tasks covering counting and retrieval reasoning. We evaluate 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families. Our results show that software architecture diagram understanding remains challenging for current models: the best-performing model gemini-3-flash-preview achieves only 70.18\% accuracy, while gpt-4o-mini only achieves 17.77\% accuracy. The results further reveal the weaknesses in diagram reasoning and visual relation grounding, highlighting a gap between current VLMs and the needs of design-stage software engineering. SADU provides a foundation for future research on diagram-aware AI systems and more faithful AI-assisted software engineering workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SADU, a benchmark for Software Architecture Diagram Understanding consisting of 154 carefully curated diagrams spanning behavioral, structural, and ER types, paired with 2,431 question-answer tasks focused on counting and retrieval reasoning. It evaluates 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families, reporting that the best model (Gemini-3-flash-preview) reaches only 70.18% accuracy while GPT-4o-mini reaches 17.77%, and concludes that current VLMs exhibit weaknesses in diagram reasoning and visual relation grounding, highlighting a gap for design-stage software engineering.
Significance. If the benchmark holds, the work is significant for providing the first dedicated empirical evaluation of VLMs on software architecture diagrams as structured SE artifacts rather than generic images. The multi-model comparison across 11 systems with no post-hoc exclusions or parameter fitting supplies a reproducible baseline that can guide future diagram-aware AI development in software engineering.
major comments (2)
- [§3 (Dataset Construction)] §3 (Dataset Construction): The claim that the 154 diagrams sufficiently represent real-world software architecture tasks rests on the description of them as 'carefully curated' spanning behavioral/structural/ER types, but the section provides no quantitative validation such as distributions of element counts, relation density, notation variants, or direct comparison to external corpora (e.g., open-source GitHub repositories). This directly affects the generalizability of the reported performance ceiling and the identified weaknesses in visual relation grounding.
- [§5 (Results)] §5 (Results): The interpretation that low accuracies reveal 'weaknesses in diagram reasoning and visual relation grounding' is plausible from the aggregate numbers, but the section lacks a per-question-type or per-diagram-type error breakdown that would isolate whether failures stem from visual grounding, counting, or relation extraction; without this, the precise nature of the gap remains underspecified.
minor comments (2)
- [Abstract and §4] Abstract and §4: The mention of 'structured annotations' should be expanded with a brief description of their format (e.g., whether they include explicit relation graphs or bounding boxes) to allow readers to assess how the 2,431 tasks were derived.
- [Results tables] Results tables: Adding 95% confidence intervals or standard deviations to the reported accuracy percentages would strengthen statistical interpretation of the model comparisons.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment below and will incorporate the suggested improvements to strengthen the manuscript.
read point-by-point responses
-
Referee: §3 (Dataset Construction): The claim that the 154 diagrams sufficiently represent real-world software architecture tasks rests on the description of them as 'carefully curated' spanning behavioral/structural/ER types, but the section provides no quantitative validation such as distributions of element counts, relation density, notation variants, or direct comparison to external corpora (e.g., open-source GitHub repositories). This directly affects the generalizability of the reported performance ceiling and the identified weaknesses in visual relation grounding.
Authors: We agree that quantitative validation would improve the characterization of the benchmark's representativeness. In the revised manuscript, we will expand §3 with a new table and accompanying text providing summary statistics on element counts (nodes, edges, labels), relation densities, and notation variants across the 154 diagrams. Where feasible, we will also include a brief comparison to a sampled set of architecture diagrams from public GitHub repositories to support generalizability claims. revision: yes
-
Referee: §5 (Results): The interpretation that low accuracies reveal 'weaknesses in diagram reasoning and visual relation grounding' is plausible from the aggregate numbers, but the section lacks a per-question-type or per-diagram-type error breakdown that would isolate whether failures stem from visual grounding, counting, or relation extraction; without this, the precise nature of the gap remains underspecified.
Authors: We agree that a finer-grained error analysis would better isolate the sources of model failures. In the revision, we will augment §5 with per-question-type accuracy breakdowns (counting vs. retrieval) and per-diagram-type results (behavioral, structural, ER). We will also add a short qualitative subsection with representative error examples to clarify whether issues arise primarily from visual grounding, counting, or relation extraction. revision: yes
Circularity Check
No circularity: pure empirical benchmark with direct model evaluations
full rationale
The paper creates the SADU benchmark (154 diagrams, 2431 QA tasks) and reports accuracy numbers for 11 external VLMs (Gemini, Claude, GPT, Qwen families). No equations, derivations, fitted parameters, predictions, or self-citations are used to generate the central results; accuracies are measured outcomes on the curated test set. The work is self-contained against external models and does not reduce any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
-
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.
Reference graph
Works this paper leans on
-
[1]
https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash
-
[2]
https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash-lite
-
[3]
https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview
-
[4]
https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite-preview
-
[5]
https://aws.amazon.com/what-is/sdlc/#how-does-sdlc-work–f1ezvt
-
[6]
https://developers.openai.com/api/docs/models/gpt-4o-mini
-
[7]
https://developers.openai.com/api/docs/models/gpt-5-nano
-
[8]
https://developers.openai.com/api/docs/models/gpt-5.4
-
[9]
https://doi.org/10.5281/zenodo.19339991
-
[10]
https://github.com/matovaro/pyunml-dataset
-
[11]
https://learn.microsoft.com/en-us/azure/architecture/
-
[12]
https://mermaid.com/
-
[13]
https://qwen.ai/blog?id=qwen2.5-vl
-
[14]
https://www.anthropic.com/news/claude-haiku-4-5
-
[15]
https://www.anthropic.com/news/claude-sonnet-4-5
-
[16]
https://www.figma.com
-
[17]
https://www.lucidchart.com
-
[18]
https://www.miro.com
-
[19]
State of software architecture report - 2024, 2024
work page 2024
-
[20]
State of software architecture report — 2025, 2026
work page 2025
-
[21]
Fangwei Chen, Li Zhang, Xiaoli Lian, and Nan Niu. Automatically recognizing the semantic elements from uml class diagram images.Journal of Systems and Software, 193:111431, 2022
work page 2022
-
[22]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv preprint arXiv:2502.08235, 2025
-
[25]
Draft-ing architectural design decisions using llms.arXiv preprint arXiv:2504.08207, 2025
Rudra Dhar, Adyansh Kakran, Amey Karan, Karthik Vaidhyanathan, and Va- sudeva Varma. Draft-ing architectural design decisions using llms.arXiv preprint arXiv:2504.08207, 2025
-
[26]
Do vision-language models really understand visual language?arXiv preprint arXiv:2410.00193, 2024
Yifan Hou, Buse Giledereli, Yilei Tu, and Mrinmaya Sachan. Do vision-language models really understand visual language?arXiv preprint arXiv:2410.00193, 2024
-
[27]
Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, and Yinpeng Dong. Mitigating overthinking in large reasoning models via manifold steering.arXiv preprint arXiv:2505.22411, 2025
-
[28]
James Ivers and Ipek Ozkaya. Will generative ai fill the automation gap in software architecting? In2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), pages 41–45. IEEE, 2025
work page 2025
-
[29]
Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang, Zhibo Yang, Junyang Lin, Yu Gu, Ge Yu, and Maosong Sun. Unikie-bench: Benchmarking large multimodal models for key information extraction in visual documents. arXiv preprint arXiv:2602.07038, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Effects of defects in uml models: an experimental investigation
Christian FJ Lange and Michel RV Chaudron. Effects of defects in uml models: an experimental investigation. InProceedings of the 28th international conference on Software engineering, pages 401–411, 2006
work page 2006
-
[32]
Devbench: A comprehensive benchmark for software development
Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. Devbench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3, 2024
-
[33]
On the perception bottleneck of vlms for chart understanding.arXiv preprint arXiv:2503.18435, 2025
Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, and Junxian He. On the perception bottleneck of vlms for chart understanding.arXiv preprint arXiv:2503.18435, 2025
-
[34]
Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024
work page 2024
-
[35]
Argus: Vision-centric reasoning with grounded chain-of-thought
Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. Argus: Vision-centric reasoning with grounded chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14268–14280, 2025
work page 2025
-
[36]
Docvlm: Make your vlm an efficient reader
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. Docvlm: Make your vlm an efficient reader. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29005–29015, 2025. Shuyin Ouyang, Jie M. Zhang, Jingzhi Gong, Gunel Jahangirova, Mohammad Reza Mousavi, Jack Johns...
work page 2025
-
[37]
A survey into the rigor of uml use and its perceived impact on quality and productivity
Ariadi Nugroho and Michel RV Chaudron. A survey into the rigor of uml use and its perceived impact on quality and productivity. InProceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pages 90–99, 2008
work page 2008
-
[38]
Dscodebench: A realistic benchmark for data science code generation
Shuyin Ouyang, Dong Huang, Jingwen Guo, Zeyu Sun, Qihao Zhu, and Jie M Zhang. Dscodebench: A realistic benchmark for data science code generation. arXiv preprint arXiv:2505.15621, 2025
-
[39]
Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, and Anton van den Hengel. Math blind: Failures in diagram understanding undermine reasoning in mllms.arXiv preprint arXiv:2503.20745, 2025
-
[40]
From Charts to Code: A Hierarchical Benchmark for Multimodal Models
Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, and Alex Jinpeng Wang. From charts to code: A hierarchical benchmark for multimodal models.arXiv preprint arXiv:2510.17932, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, and Daniel Dahlmeier. Document intelligence in the era of large language models: A survey.arXiv preprint arXiv:2510.13366, 2025
-
[42]
Testeval: Benchmarking large language models for test case generation
Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3547–3562, 2025
work page 2025
-
[43]
Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024
work page 2024
-
[44]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer envi- ronments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[45]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024
work page 2024
-
[46]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.