Recognition: 1 theorem link
· Lean TheoremUnlocking UML Class Diagram Understanding in Vision Language Models
Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3
The pith
Finetuning a vision language model on a custom UML class diagram dataset allows it to surpass a much larger general model on diagram-based questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a LoRA-based finetune on a newly constructed dataset of 16,000 image-question-answer triples for UML class diagrams enables a vision language model to outperform the Qwen 3.5 27B model on a dedicated benchmark for this task.
What carries the argument
The key mechanism is the creation of a large training dataset consisting of 16,000 image-question-answer triples from UML class diagrams, which supports efficient LoRA finetuning to adapt general VLMs to this domain.
If this is right
- The benchmark offers a standardized way to measure VLM performance specifically on UML class diagrams.
- LoRA finetuning provides an accessible method to boost diagram understanding without training from scratch.
- Performance gains on this dataset indicate that targeted training can overcome limitations in handling complex structured diagrams.
- Similar datasets could be built for other diagram types in computer science.
Where Pith is reading between the lines
- This approach might integrate into software tools to automatically answer questions about code architecture from diagrams.
- The success suggests that many technical diagram types remain underexplored in current VLM capabilities.
- Testing the finetuned model on diagrams from diverse programming languages or styles would reveal the breadth of its learned understanding.
Load-bearing premise
The constructed dataset of 16,000 triples must be representative of real UML diagrams and the questions must demand actual understanding of the diagram structure rather than memorized patterns.
What would settle it
Running the finetuned model and the original Qwen 3.5 27B on a fresh collection of UML class diagrams from open-source projects and finding no performance advantage for the finetuned version.
read the original abstract
Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a benchmark for visual question answering on UML class diagrams, constructs a dataset of 16,000 image-question-answer triples, and reports that LoRA-based fine-tuning of a vision-language model outperforms the Qwen 3.5 27B model.
Significance. If the dataset proves representative of real UML diagrams, the questions require genuine diagram reasoning, and the evaluation uses held-out test data with proper controls, this could meaningfully advance VLM capabilities on technical CS diagrams where current models underperform. The focus on an understudied diagram type and the use of efficient LoRA fine-tuning are practical contributions. However, the absence of any metrics, generation details, or validation leaves the significance indeterminate.
major comments (2)
- [Abstract] Abstract: The central claim that a LoRA-based finetune 'easily outperforms' Qwen 3.5 27B is unsupported by any reported metrics (e.g., accuracy, F1), statistical tests, baseline details, or results tables. This is load-bearing for the empirical contribution.
- [Abstract] Abstract: No description is given of UML diagram sources, QA generation method (template-based, LLM-assisted, or human), train/test split, or quality controls for the 16k triples. Without these, it is impossible to determine whether outperformance reflects generalization or dataset artifacts/leakage.
minor comments (3)
- [Abstract] Typo: 'regard-ing' should be 'regarding'.
- [Abstract] Grammar: 'there is still few research' should be 'there is still little research'.
- [Abstract] The phrase 'LoRA-based finetune' is informal; use 'LoRA-based fine-tuning' for consistency with technical writing.
Simulated Author's Rebuttal
We appreciate the referee's careful reading and valuable feedback on our paper. The comments point to necessary clarifications in the abstract and additional details on the dataset. We will perform a major revision to incorporate these improvements, ensuring the manuscript provides sufficient information for assessing the work's contributions. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that a LoRA-based finetune 'easily outperforms' Qwen 3.5 27B is unsupported by any reported metrics (e.g., accuracy, F1), statistical tests, baseline details, or results tables. This is load-bearing for the empirical contribution.
Authors: We concur that the abstract's claim requires supporting evidence to be credible. The revised abstract will explicitly state the performance metrics achieved by the LoRA fine-tuned model compared to Qwen 3.5 27B. Furthermore, we will add a comprehensive results section with tables, baseline details, and any relevant statistical analysis to substantiate the outperformance claim. revision: yes
-
Referee: [Abstract] Abstract: No description is given of UML diagram sources, QA generation method (template-based, LLM-assisted, or human), train/test split, or quality controls for the 16k triples. Without these, it is impossible to determine whether outperformance reflects generalization or dataset artifacts/leakage.
Authors: We agree that without these descriptions, the validity of the results cannot be fully evaluated. In the revised manuscript, we will include detailed information on the UML diagram sources, the QA pair generation methodology, the train/test split, and the quality control procedures applied to the 16,000 triples. This will help demonstrate that the observed improvements are due to genuine learning rather than artifacts. revision: yes
Circularity Check
No circularity; empirical dataset and benchmark comparison
full rationale
The paper introduces a new UML class diagram VQA benchmark and a 16k image-QA training set, then reports that LoRA finetuning of an (unspecified) base VLM outperforms the external Qwen 3.5 27B model. No equations, self-definitional loops, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The central claim is a direct empirical result on a newly constructed task against an independent baseline; it does not reduce to its own inputs by construction. Self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The generated dataset of 16,000 image-question-answer triples is of sufficient quality, diversity, and fidelity to real UML class diagrams to support meaningful generalization after LoRA fine-tuning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We finetune Qwen 2.5 VL 7B ... on our dataset with ~16.000 question image pairs with LoRA ... Our best finetune ... achieves 85.9% accuracy outperforming ... Qwen 3.5 27B
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervi- sion,
A. Radford et al., “Learning transferable visual models from natural language supervi- sion,” International Conference on Machine Learning, pp. 8748–8763, 2021
work page 2021
-
[2]
M. Tschannen et al., “SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features,” Feb. 20, 2025, arXiv: arXiv:2502.14786. doi: 10.48550/arXiv.2502.14786
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.14786 2025
-
[3]
Emerging properties in self-supervised vision transformers,
M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proceed- ings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660. Accessed: Jan. 26, 2026. [Online]. Available: https://openaccess.thecvf.com/con- tent/ICCV2021/html/Caron_Emerging_Properties_in_Self- Supervised_Vision_Transformers_ICCV_2021_paper
work page 2021
-
[4]
LLaVA -Gemma: Accel- erating Multimodal Foundation Models with a Compact Language Model,
M. Hinck, M. L. Olson, D. Cobbley, S. -Y. Tseng, and V. Lal, “LLaVA -Gemma: Accel- erating Multimodal Foundation Models with a Compact Language Model,” Jun. 10, 2024, arXiv: arXiv:2404.01331. doi: 10.48550/arXiv.2404.01331
-
[5]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning,
A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,” in Findings of the association for computational linguistics: ACL 2022, 2022, pp. 2263–2279. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthology.org/2022.findings-acl.177/
work page 2022
-
[6]
A. Masry et al., “ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering,” Apr. 10, 2025, arXiv: arXiv:2504.05506. doi: 10.48550/arXiv.2504.05506
-
[7]
H. G. Ranjani and R. Prabhudesai, “Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs,” in Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, 2025, pp. 9–20. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthology.org/2025.eval4nlp-1.2/
work page 2025
-
[8]
Structured Extraction from Business Process Diagrams Using Vision-Language Models,
P. Deka and B. Devereux, “Structured Extraction from Business Process Diagrams Using Vision-Language Models,” Nov. 27, 2025, arXiv: arXiv:2511.22448. doi: 10.48550/arXiv.2511.22448
-
[9]
LLM-Driven MDA Pipeline for Generating UML Class Diagrams and Code,
Z. Babaalla, A. Jakimi, and M. Oualla, “LLM-Driven MDA Pipeline for Generating UML Class Diagrams and Code,” IEEE Access, 2025, Accessed: Jan. 26, 2026. [Online]. Avail- able: https://ieeexplore.ieee.org/abstract/document/11184520/
-
[10]
From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs,
A. Conrardy and J. Cabot, “From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs,” Apr. 17, 2024, arXiv: arXiv:2404.11376. Accessed: Apr. 28, 2024. [Online]. Available: http://arxiv.org/abs/2404.11376
-
[11]
D. De Bari, “Evaluating Large Language Models in Software Design: A Comparative Analysis of UML Class Diagram Generation,” PhD Thesis, Politecnico di Torino, 2024. Accessed: Apr. 28, 2024. [Online]. Available: https://webthesis.biblio.polito.it/31177/
work page 2024
-
[12]
NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements
P. Giannouris and S. Ananiadou, “NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements,” Nov. 27, 2025, arXiv: arXiv:2511.22409. doi: 10.48550/arXiv.2511.22409. 10
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.22409 2025
-
[14]
V.-V. Nguyen, H.-K. Nguyen, K.-S. Nguyen, H. Luong Thi Minh, T.-V. Nguyen, and D.- Q. Vu, “A Novel Pipeline for Automatic UML Sequence Diagram Synthesis and Multi- modal Scoring,” in Intelligent Systems and Data Science, vol. 2713, N. Thai-Nghe, T.-N. Do, and S. Benferhat, Eds., in Communications in Computer and Information Science, vol. 2713. , Singapore...
-
[15]
LLM-based Generation and Evaluation of UML Class Diagrams,
R. Soldati, “LLM-based Generation and Evaluation of UML Class Diagrams,” PhD The- sis, Politecnico di Torino, 2025. Accessed: Jan. 26, 2026. [Online]. Available: https://webthesis.biblio.polito.it/35962/
work page 2025
-
[16]
Lora: Low-rank adaptation of large language models.,
E. J. Hu et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022
work page 2022
-
[17]
S. Mukhopadhyay, A. Qidwai, A. Garimella, P. Ramu, V. Gupta, and D. Roth, “Unravel- ing the truth: Do VLMs really understand charts? a deep dive into consistency and robust- ness,” in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 16696 –16717. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthol- ogy.org/2024....
work page 2024
-
[18]
Limits and Gains of Test-Time Scaling in Vision-Language Reasoning,
M. Ahmadpour, A. Meighani, P. Taebi, O. Ghahroodi, A. Izadi, and M. S. Baghshah, “Limits and Gains of Test-Time Scaling in Vision-Language Reasoning,” Dec. 11, 2025, arXiv: arXiv:2512.11109. doi: 10.48550/arXiv.2512.11109
-
[19]
R. Peinl and V. Tischler, “VLM@school: Evaluation of AI Image Understanding on Ger- man Middle School Knowledge,” in Proceedings of the Future Technologies Conference (FTC) 2025, Volume 1, vol. 1675, K. Arai, Ed., in Lecture Notes in Networks and Systems, vol. 1675. , Cham: Springer Nature Switzerland, 2025, pp. 664–680. doi: 10.1007/978-3- 032-07986-2_41
-
[20]
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation,
S. Joshi et al., “MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation,” Jan. 07, 2025, arXiv: arXiv:2501.04155. doi: 10.48550/arXiv.2501.04155
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.