pith. machine review for the scientific record. sign in

arxiv: 2605.11634 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Unlocking UML Class Diagram Understanding in Vision Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords UML class diagramsvision language modelsvisual question answeringLoRA finetuningdataset constructionmodel adaptationsoftware diagrams
0
0 comments X

The pith

Finetuning a vision language model on a custom UML class diagram dataset allows it to surpass a much larger general model on diagram-based questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve how vision language models handle questions about UML class diagrams, which are structured drawings used in software design. Current models perform well on photos and simple charts but struggle with these technical diagrams. To address this, the authors build both a test benchmark and a training collection of 16,000 image-question-answer pairs. They then show that applying a parameter-efficient LoRA finetune to a base model produces results better than those from the Qwen 3.5 27B model, a capable untuned VLM.

Core claim

The central discovery is that a LoRA-based finetune on a newly constructed dataset of 16,000 image-question-answer triples for UML class diagrams enables a vision language model to outperform the Qwen 3.5 27B model on a dedicated benchmark for this task.

What carries the argument

The key mechanism is the creation of a large training dataset consisting of 16,000 image-question-answer triples from UML class diagrams, which supports efficient LoRA finetuning to adapt general VLMs to this domain.

If this is right

  • The benchmark offers a standardized way to measure VLM performance specifically on UML class diagrams.
  • LoRA finetuning provides an accessible method to boost diagram understanding without training from scratch.
  • Performance gains on this dataset indicate that targeted training can overcome limitations in handling complex structured diagrams.
  • Similar datasets could be built for other diagram types in computer science.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might integrate into software tools to automatically answer questions about code architecture from diagrams.
  • The success suggests that many technical diagram types remain underexplored in current VLM capabilities.
  • Testing the finetuned model on diagrams from diverse programming languages or styles would reveal the breadth of its learned understanding.

Load-bearing premise

The constructed dataset of 16,000 triples must be representative of real UML diagrams and the questions must demand actual understanding of the diagram structure rather than memorized patterns.

What would settle it

Running the finetuned model and the original Qwen 3.5 27B on a fresh collection of UML class diagrams from open-source projects and finding no performance advantage for the finetuned version.

read the original abstract

Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces a benchmark for visual question answering on UML class diagrams, constructs a dataset of 16,000 image-question-answer triples, and reports that LoRA-based fine-tuning of a vision-language model outperforms the Qwen 3.5 27B model.

Significance. If the dataset proves representative of real UML diagrams, the questions require genuine diagram reasoning, and the evaluation uses held-out test data with proper controls, this could meaningfully advance VLM capabilities on technical CS diagrams where current models underperform. The focus on an understudied diagram type and the use of efficient LoRA fine-tuning are practical contributions. However, the absence of any metrics, generation details, or validation leaves the significance indeterminate.

major comments (2)
  1. [Abstract] Abstract: The central claim that a LoRA-based finetune 'easily outperforms' Qwen 3.5 27B is unsupported by any reported metrics (e.g., accuracy, F1), statistical tests, baseline details, or results tables. This is load-bearing for the empirical contribution.
  2. [Abstract] Abstract: No description is given of UML diagram sources, QA generation method (template-based, LLM-assisted, or human), train/test split, or quality controls for the 16k triples. Without these, it is impossible to determine whether outperformance reflects generalization or dataset artifacts/leakage.
minor comments (3)
  1. [Abstract] Typo: 'regard-ing' should be 'regarding'.
  2. [Abstract] Grammar: 'there is still few research' should be 'there is still little research'.
  3. [Abstract] The phrase 'LoRA-based finetune' is informal; use 'LoRA-based fine-tuning' for consistency with technical writing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's careful reading and valuable feedback on our paper. The comments point to necessary clarifications in the abstract and additional details on the dataset. We will perform a major revision to incorporate these improvements, ensuring the manuscript provides sufficient information for assessing the work's contributions. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that a LoRA-based finetune 'easily outperforms' Qwen 3.5 27B is unsupported by any reported metrics (e.g., accuracy, F1), statistical tests, baseline details, or results tables. This is load-bearing for the empirical contribution.

    Authors: We concur that the abstract's claim requires supporting evidence to be credible. The revised abstract will explicitly state the performance metrics achieved by the LoRA fine-tuned model compared to Qwen 3.5 27B. Furthermore, we will add a comprehensive results section with tables, baseline details, and any relevant statistical analysis to substantiate the outperformance claim. revision: yes

  2. Referee: [Abstract] Abstract: No description is given of UML diagram sources, QA generation method (template-based, LLM-assisted, or human), train/test split, or quality controls for the 16k triples. Without these, it is impossible to determine whether outperformance reflects generalization or dataset artifacts/leakage.

    Authors: We agree that without these descriptions, the validity of the results cannot be fully evaluated. In the revised manuscript, we will include detailed information on the UML diagram sources, the QA pair generation methodology, the train/test split, and the quality control procedures applied to the 16,000 triples. This will help demonstrate that the observed improvements are due to genuine learning rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset and benchmark comparison

full rationale

The paper introduces a new UML class diagram VQA benchmark and a 16k image-QA training set, then reports that LoRA finetuning of an (unspecified) base VLM outperforms the external Qwen 3.5 27B model. No equations, self-definitional loops, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The central claim is a direct empirical result on a newly constructed task against an independent baseline; it does not reduce to its own inputs by construction. Self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified quality and representativeness of the newly constructed dataset; no new mathematical axioms or invented physical entities are introduced.

axioms (1)
  • domain assumption The generated dataset of 16,000 image-question-answer triples is of sufficient quality, diversity, and fidelity to real UML class diagrams to support meaningful generalization after LoRA fine-tuning.
    This assumption is required for the outperformance claim to be meaningful but receives no supporting evidence in the abstract.

pith-pipeline@v0.9.0 · 5418 in / 1336 out tokens · 52491 ms · 2026-05-13T01:36:25.408596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford et al., “Learning transferable visual models from natural language supervi- sion,” International Conference on Machine Learning, pp. 8748–8763, 2021

  2. [2]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen et al., “SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features,” Feb. 20, 2025, arXiv: arXiv:2502.14786. doi: 10.48550/arXiv.2502.14786

  3. [3]

    Emerging properties in self-supervised vision transformers,

    M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proceed- ings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660. Accessed: Jan. 26, 2026. [Online]. Available: https://openaccess.thecvf.com/con- tent/ICCV2021/html/Caron_Emerging_Properties_in_Self- Supervised_Vision_Transformers_ICCV_2021_paper

  4. [4]

    LLaVA -Gemma: Accel- erating Multimodal Foundation Models with a Compact Language Model,

    M. Hinck, M. L. Olson, D. Cobbley, S. -Y. Tseng, and V. Lal, “LLaVA -Gemma: Accel- erating Multimodal Foundation Models with a Compact Language Model,” Jun. 10, 2024, arXiv: arXiv:2404.01331. doi: 10.48550/arXiv.2404.01331

  5. [5]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning,

    A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque, “Chartqa: A benchmark for question answering about charts with visual and logical reasoning,” in Findings of the association for computational linguistics: ACL 2022, 2022, pp. 2263–2279. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthology.org/2022.findings-acl.177/

  6. [6]

    Chartqapro: A more di- verse and challenging benchmark for chart question answer- ing.arXiv preprint arXiv:2504.05506, 2025

    A. Masry et al., “ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering,” Apr. 10, 2025, arXiv: arXiv:2504.05506. doi: 10.48550/arXiv.2504.05506

  7. [7]

    Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs,

    H. G. Ranjani and R. Prabhudesai, “Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs,” in Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, 2025, pp. 9–20. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthology.org/2025.eval4nlp-1.2/

  8. [8]

    Structured Extraction from Business Process Diagrams Using Vision-Language Models,

    P. Deka and B. Devereux, “Structured Extraction from Business Process Diagrams Using Vision-Language Models,” Nov. 27, 2025, arXiv: arXiv:2511.22448. doi: 10.48550/arXiv.2511.22448

  9. [9]

    LLM-Driven MDA Pipeline for Generating UML Class Diagrams and Code,

    Z. Babaalla, A. Jakimi, and M. Oualla, “LLM-Driven MDA Pipeline for Generating UML Class Diagrams and Code,” IEEE Access, 2025, Accessed: Jan. 26, 2026. [Online]. Avail- able: https://ieeexplore.ieee.org/abstract/document/11184520/

  10. [10]

    From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs,

    A. Conrardy and J. Cabot, “From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs,” Apr. 17, 2024, arXiv: arXiv:2404.11376. Accessed: Apr. 28, 2024. [Online]. Available: http://arxiv.org/abs/2404.11376

  11. [11]

    Evaluating Large Language Models in Software Design: A Comparative Analysis of UML Class Diagram Generation,

    D. De Bari, “Evaluating Large Language Models in Software Design: A Comparative Analysis of UML Class Diagram Generation,” PhD Thesis, Politecnico di Torino, 2024. Accessed: Apr. 28, 2024. [Online]. Available: https://webthesis.biblio.polito.it/31177/

  12. [12]

    NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements

    P. Giannouris and S. Ananiadou, “NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements,” Nov. 27, 2025, arXiv: arXiv:2511.22409. doi: 10.48550/arXiv.2511.22409. 10

  13. [14]

    Gilbarg and N

    V.-V. Nguyen, H.-K. Nguyen, K.-S. Nguyen, H. Luong Thi Minh, T.-V. Nguyen, and D.- Q. Vu, “A Novel Pipeline for Automatic UML Sequence Diagram Synthesis and Multi- modal Scoring,” in Intelligent Systems and Data Science, vol. 2713, N. Thai-Nghe, T.-N. Do, and S. Benferhat, Eds., in Communications in Computer and Information Science, vol. 2713. , Singapore...

  14. [15]

    LLM-based Generation and Evaluation of UML Class Diagrams,

    R. Soldati, “LLM-based Generation and Evaluation of UML Class Diagrams,” PhD The- sis, Politecnico di Torino, 2025. Accessed: Jan. 26, 2026. [Online]. Available: https://webthesis.biblio.polito.it/35962/

  15. [16]

    Lora: Low-rank adaptation of large language models.,

    E. J. Hu et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022

  16. [17]

    Unravel- ing the truth: Do VLMs really understand charts? a deep dive into consistency and robust- ness,

    S. Mukhopadhyay, A. Qidwai, A. Garimella, P. Ramu, V. Gupta, and D. Roth, “Unravel- ing the truth: Do VLMs really understand charts? a deep dive into consistency and robust- ness,” in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 16696 –16717. Accessed: Jan. 26, 2026. [Online]. Available: https://aclanthol- ogy.org/2024....

  17. [18]

    Limits and Gains of Test-Time Scaling in Vision-Language Reasoning,

    M. Ahmadpour, A. Meighani, P. Taebi, O. Ghahroodi, A. Izadi, and M. S. Baghshah, “Limits and Gains of Test-Time Scaling in Vision-Language Reasoning,” Dec. 11, 2025, arXiv: arXiv:2512.11109. doi: 10.48550/arXiv.2512.11109

  18. [19]

    Wachinger, W

    R. Peinl and V. Tischler, “VLM@school: Evaluation of AI Image Understanding on Ger- man Middle School Knowledge,” in Proceedings of the Future Technologies Conference (FTC) 2025, Volume 1, vol. 1675, K. Arai, Ed., in Lecture Notes in Networks and Systems, vol. 1675. , Cham: Springer Nature Switzerland, 2025, pp. 664–680. doi: 10.1007/978-3- 032-07986-2_41

  19. [20]

    MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation,

    S. Joshi et al., “MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation,” Jan. 07, 2025, arXiv: arXiv:2501.04155. doi: 10.48550/arXiv.2501.04155