arxiv: 2605.11634 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Unlocking UML Class Diagram Understanding in Vision Language Models

Artem Naboichenko , Ren\'e Peinl

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords UML class diagramsvision language modelsvisual question answeringLoRA finetuningdataset constructionmodel adaptationsoftware diagrams

0 comments

The pith

Finetuning a vision language model on a custom UML class diagram dataset allows it to surpass a much larger general model on diagram-based questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve how vision language models handle questions about UML class diagrams, which are structured drawings used in software design. Current models perform well on photos and simple charts but struggle with these technical diagrams. To address this, the authors build both a test benchmark and a training collection of 16,000 image-question-answer pairs. They then show that applying a parameter-efficient LoRA finetune to a base model produces results better than those from the Qwen 3.5 27B model, a capable untuned VLM.

Core claim

The central discovery is that a LoRA-based finetune on a newly constructed dataset of 16,000 image-question-answer triples for UML class diagrams enables a vision language model to outperform the Qwen 3.5 27B model on a dedicated benchmark for this task.

What carries the argument

The key mechanism is the creation of a large training dataset consisting of 16,000 image-question-answer triples from UML class diagrams, which supports efficient LoRA finetuning to adapt general VLMs to this domain.

If this is right

The benchmark offers a standardized way to measure VLM performance specifically on UML class diagrams.
LoRA finetuning provides an accessible method to boost diagram understanding without training from scratch.
Performance gains on this dataset indicate that targeted training can overcome limitations in handling complex structured diagrams.
Similar datasets could be built for other diagram types in computer science.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might integrate into software tools to automatically answer questions about code architecture from diagrams.
The success suggests that many technical diagram types remain underexplored in current VLM capabilities.
Testing the finetuned model on diagrams from diverse programming languages or styles would reveal the breadth of its learned understanding.

Load-bearing premise

The constructed dataset of 16,000 triples must be representative of real UML diagrams and the questions must demand actual understanding of the diagram structure rather than memorized patterns.

What would settle it

Running the finetuned model and the original Qwen 3.5 27B on a fresh collection of UML class diagrams from open-source projects and finding no performance advantage for the finetuned version.

read the original abstract

Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a targeted dataset for UML class diagrams but the evaluation lacks the details needed to confirm real progress.

read the letter

Here's the quick take on this one: the authors put together a benchmark and a 16,000-example dataset for asking questions about UML class diagrams, then show that a LoRA fine-tune on top of some base model beats Qwen 3.5 27B. What they did well is pick a diagram type that actually matters for software work. Most diagram VLM papers stay with bar charts or flowcharts, so moving into class diagrams is a reasonable next step. The dataset size is large enough to be useful for fine-tuning, and LoRA keeps things efficient. If the full paper releases the data and code, that alone could be handy for people working on developer tools. The main weakness is that the abstract gives almost no information on how any of this was built. We do not know if the UML diagrams come from real projects or were generated synthetically, how the questions were created, what kind of reasoning they test, or how the train and test sets were split. The stress-test note is right on this point. Without those details, the outperformance could easily be an artifact of how the data was made rather than a genuine advance in understanding diagrams. The paper would be much stronger with concrete metrics, a description of the baseline setup, and some checks that the test questions are not leaking from the training process. This paper is for people who care about applying VLMs to software engineering tasks, like code review assistants or architecture visualization tools. A reader who needs a starting dataset for UML VQA might get value from it, but anyone wanting to build on the results will have to wait for the methods section to see if the claims hold up. I would recommend sending it to peer review. The niche is legitimate and the resource could be practical, but the reviewers will need to press for the missing controls and evidence before it can be considered solid.

Referee Report

2 major / 3 minor

Summary. The paper introduces a benchmark for visual question answering on UML class diagrams, constructs a dataset of 16,000 image-question-answer triples, and reports that LoRA-based fine-tuning of a vision-language model outperforms the Qwen 3.5 27B model.

Significance. If the dataset proves representative of real UML diagrams, the questions require genuine diagram reasoning, and the evaluation uses held-out test data with proper controls, this could meaningfully advance VLM capabilities on technical CS diagrams where current models underperform. The focus on an understudied diagram type and the use of efficient LoRA fine-tuning are practical contributions. However, the absence of any metrics, generation details, or validation leaves the significance indeterminate.

major comments (2)

[Abstract] Abstract: The central claim that a LoRA-based finetune 'easily outperforms' Qwen 3.5 27B is unsupported by any reported metrics (e.g., accuracy, F1), statistical tests, baseline details, or results tables. This is load-bearing for the empirical contribution.
[Abstract] Abstract: No description is given of UML diagram sources, QA generation method (template-based, LLM-assisted, or human), train/test split, or quality controls for the 16k triples. Without these, it is impossible to determine whether outperformance reflects generalization or dataset artifacts/leakage.

minor comments (3)

[Abstract] Typo: 'regard-ing' should be 'regarding'.
[Abstract] Grammar: 'there is still few research' should be 'there is still little research'.
[Abstract] The phrase 'LoRA-based finetune' is informal; use 'LoRA-based fine-tuning' for consistency with technical writing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's careful reading and valuable feedback on our paper. The comments point to necessary clarifications in the abstract and additional details on the dataset. We will perform a major revision to incorporate these improvements, ensuring the manuscript provides sufficient information for assessing the work's contributions. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a LoRA-based finetune 'easily outperforms' Qwen 3.5 27B is unsupported by any reported metrics (e.g., accuracy, F1), statistical tests, baseline details, or results tables. This is load-bearing for the empirical contribution.

Authors: We concur that the abstract's claim requires supporting evidence to be credible. The revised abstract will explicitly state the performance metrics achieved by the LoRA fine-tuned model compared to Qwen 3.5 27B. Furthermore, we will add a comprehensive results section with tables, baseline details, and any relevant statistical analysis to substantiate the outperformance claim. revision: yes
Referee: [Abstract] Abstract: No description is given of UML diagram sources, QA generation method (template-based, LLM-assisted, or human), train/test split, or quality controls for the 16k triples. Without these, it is impossible to determine whether outperformance reflects generalization or dataset artifacts/leakage.

Authors: We agree that without these descriptions, the validity of the results cannot be fully evaluated. In the revised manuscript, we will include detailed information on the UML diagram sources, the QA pair generation methodology, the train/test split, and the quality control procedures applied to the 16,000 triples. This will help demonstrate that the observed improvements are due to genuine learning rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset and benchmark comparison

full rationale

The paper introduces a new UML class diagram VQA benchmark and a 16k image-QA training set, then reports that LoRA finetuning of an (unspecified) base VLM outperforms the external Qwen 3.5 27B model. No equations, self-definitional loops, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The central claim is a direct empirical result on a newly constructed task against an independent baseline; it does not reduce to its own inputs by construction. Self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified quality and representativeness of the newly constructed dataset; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption The generated dataset of 16,000 image-question-answer triples is of sufficient quality, diversity, and fidelity to real UML class diagrams to support meaningful generalization after LoRA fine-tuning.
This assumption is required for the outperformance claim to be meaningful but receives no supporting evidence in the abstract.

pith-pipeline@v0.9.0 · 5418 in / 1336 out tokens · 52491 ms · 2026-05-13T01:36:25.408596+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We finetune Qwen 2.5 VL 7B ... on our dataset with ~16.000 question image pairs with LoRA ... Our best finetune ... achieves 85.9% accuracy outperforming ... Qwen 3.5 27B

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Learning transferable visual models from natural language supervi- sion,

A. Radford et al., “Learning transferable visual models from natural language supervi- sion,” International Conference on Machine Learning, pp. 8748–8763, 2021

work page 2021
[2]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen et al., “SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features,” Feb. 20, 2025, arXiv: arXiv:2502.14786. doi: 10.48550/arXiv.2502.14786

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.14786 2025
[3]

Emerging properties in self-supervised vision transformers,

M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proceed- ings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660. Accessed: Jan. 26, 2026. [Online]. Available: https://openaccess.thecvf.com/con- tent/ICCV2021/html/Caron_Emerging_Properties_in_Self- Supervised_Vision_Transformers_ICCV_2021_paper

work page 2021
[4]

LLaVA -Gemma: Accel- erating Multimodal Foundation Models with a Compact Language Model,

M. Hinck, M. L. Olson, D. Cobbley, S. -Y. Tseng, and V. Lal, “LLaVA -Gemma: Accel- erating Multimodal Foundation Models with a Compact Language Model,” Jun. 10, 2024, arXiv: arXiv:2404.01331. doi: 10.48550/arXiv.2404.01331

work page doi:10.48550/arxiv.2404.01331 2024
[5]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning,

A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,” in Findings of the association for computational linguistics: ACL 2022, 2022, pp. 2263–2279. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthology.org/2022.findings-acl.177/

work page 2022
[6]

Chartqapro: A more di- verse and challenging benchmark for chart question answer- ing.arXiv preprint arXiv:2504.05506, 2025

A. Masry et al., “ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering,” Apr. 10, 2025, arXiv: arXiv:2504.05506. doi: 10.48550/arXiv.2504.05506

work page doi:10.48550/arxiv.2504.05506 2025
[7]

Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs,

H. G. Ranjani and R. Prabhudesai, “Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs,” in Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, 2025, pp. 9–20. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthology.org/2025.eval4nlp-1.2/

work page 2025
[8]

Structured Extraction from Business Process Diagrams Using Vision-Language Models,

P. Deka and B. Devereux, “Structured Extraction from Business Process Diagrams Using Vision-Language Models,” Nov. 27, 2025, arXiv: arXiv:2511.22448. doi: 10.48550/arXiv.2511.22448

work page doi:10.48550/arxiv.2511.22448 2025
[9]

LLM-Driven MDA Pipeline for Generating UML Class Diagrams and Code,

Z. Babaalla, A. Jakimi, and M. Oualla, “LLM-Driven MDA Pipeline for Generating UML Class Diagrams and Code,” IEEE Access, 2025, Accessed: Jan. 26, 2026. [Online]. Avail- able: https://ieeexplore.ieee.org/abstract/document/11184520/

work page arXiv 2025
[10]

From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs,

A. Conrardy and J. Cabot, “From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs,” Apr. 17, 2024, arXiv: arXiv:2404.11376. Accessed: Apr. 28, 2024. [Online]. Available: http://arxiv.org/abs/2404.11376

work page arXiv 2024
[11]

Evaluating Large Language Models in Software Design: A Comparative Analysis of UML Class Diagram Generation,

D. De Bari, “Evaluating Large Language Models in Software Design: A Comparative Analysis of UML Class Diagram Generation,” PhD Thesis, Politecnico di Torino, 2024. Accessed: Apr. 28, 2024. [Online]. Available: https://webthesis.biblio.polito.it/31177/

work page 2024
[12]

NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements

P. Giannouris and S. Ananiadou, “NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements,” Nov. 27, 2025, arXiv: arXiv:2511.22409. doi: 10.48550/arXiv.2511.22409. 10

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.22409 2025
[14]

Gilbarg and N

V.-V. Nguyen, H.-K. Nguyen, K.-S. Nguyen, H. Luong Thi Minh, T.-V. Nguyen, and D.- Q. Vu, “A Novel Pipeline for Automatic UML Sequence Diagram Synthesis and Multi- modal Scoring,” in Intelligent Systems and Data Science, vol. 2713, N. Thai-Nghe, T.-N. Do, and S. Benferhat, Eds., in Communications in Computer and Information Science, vol. 2713. , Singapore...

work page doi:10.1007/978- 2026
[15]

LLM-based Generation and Evaluation of UML Class Diagrams,

R. Soldati, “LLM-based Generation and Evaluation of UML Class Diagrams,” PhD The- sis, Politecnico di Torino, 2025. Accessed: Jan. 26, 2026. [Online]. Available: https://webthesis.biblio.polito.it/35962/

work page 2025
[16]

Lora: Low-rank adaptation of large language models.,

E. J. Hu et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[17]

Unravel- ing the truth: Do VLMs really understand charts? a deep dive into consistency and robust- ness,

S. Mukhopadhyay, A. Qidwai, A. Garimella, P. Ramu, V. Gupta, and D. Roth, “Unravel- ing the truth: Do VLMs really understand charts? a deep dive into consistency and robust- ness,” in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 16696 –16717. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthol- ogy.org/2024....

work page 2024
[18]

Limits and Gains of Test-Time Scaling in Vision-Language Reasoning,

M. Ahmadpour, A. Meighani, P. Taebi, O. Ghahroodi, A. Izadi, and M. S. Baghshah, “Limits and Gains of Test-Time Scaling in Vision-Language Reasoning,” Dec. 11, 2025, arXiv: arXiv:2512.11109. doi: 10.48550/arXiv.2512.11109

work page doi:10.48550/arxiv.2512.11109 2025
[19]

Wachinger, W

R. Peinl and V. Tischler, “VLM@school: Evaluation of AI Image Understanding on Ger- man Middle School Knowledge,” in Proceedings of the Future Technologies Conference (FTC) 2025, Volume 1, vol. 1675, K. Arai, Ed., in Lecture Notes in Networks and Systems, vol. 1675. , Cham: Springer Nature Switzerland, 2025, pp. 664–680. doi: 10.1007/978-3- 032-07986-2_41

work page doi:10.1007/978-3- 2025
[20]

MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation,

S. Joshi et al., “MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation,” Jan. 07, 2025, arXiv: arXiv:2501.04155. doi: 10.48550/arXiv.2501.04155

work page doi:10.48550/arxiv.2501.04155 2025