pith. machine review for the scientific record. sign in

arxiv: 2604.23816 · v1 · submitted 2026-04-26 · 💻 cs.SE · cs.AI

Recognition: unknown

Query2Diagram: Answering Developer Queries with UML Diagrams

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords UML diagram generationquery-driven documentationLLM fine-tuningsoftware reverse engineeringnatural language to diagramdeveloper queriescode understandingcontextual documentation
0
0 comments X

The pith

Fine-tuned LLMs generate UML diagrams that answer specific developer queries about code with higher accuracy and fewer defects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models can be trained to create focused UML diagrams responding directly to natural language questions about a codebase, rather than producing exhaustive reverse-engineered views. This approach addresses the common problem of outdated or missing documentation by letting developers request only the relevant classes, relationships, and explanations they need at that moment. The authors curate a dataset pairing code snippets with queries and structured JSON diagram outputs, then fine-tune Qwen2.5-Coder-14B on the corrected examples. Their evaluation combines automatic checks for structural defects with human judgment of semantic fit, showing clear gains over untuned models. The work demonstrates that limited human curation can make intent-aware diagram generation practical at scale.

Core claim

Fine-tuning an LLM on a modest collection of manually corrected triples—code files, natural-language developer queries, and corresponding UML diagrams encoded in structured JSON—yields models that produce diagrams with the highest F1 scores for relevant elements and defect rates below those of state-of-the-art untuned LLMs, while ensuring the output remains semantically aligned with the original query.

What carries the argument

The query-driven UML generation process that maps a code file and a natural language query to a focused JSON diagram containing only pertinent classes, relationships, and contextual descriptions.

If this is right

  • Developers gain on-demand access to precise visual documentation without manually updating or sifting through full diagrams.
  • Maintenance tasks become easier because queries about specific system aspects return only the relevant structural information.
  • The need for exhaustive reverse-engineering tools decreases when intent-aware generation is available.
  • Modest amounts of human-corrected data can produce reliable specialized models for software documentation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curation-plus-fine-tuning pattern could be applied to other visual artifacts such as sequence or component diagrams if comparable datasets are built.
  • Embedding the model inside an IDE would let developers receive diagrams in response to inline questions while editing code.
  • The approach underscores that domain-specific performance in software engineering often benefits more from targeted data cleaning than from larger general-purpose models.

Load-bearing premise

The small set of manually corrected training examples reflects the distribution of real developer questions and code structures the model will see after deployment.

What would settle it

Running the best fine-tuned model on a large collection of previously unseen real developer queries drawn from open-source projects and checking whether the generated diagrams maintain higher F1 scores and lower defect rates than baseline LLMs would directly test the claim.

Figures

Figures reproduced from arXiv: 2604.23816 by (2) St. Petersburg Department of Steklov Mathematical Institute, 3), 3) ((1) HSE University, (3) St. Petersburg State University), Anton M. Alekseev (2, Oleg Baryshnikov (1), RAS, Sergey I. Nikolenko (2.

Figure 1
Figure 1. Figure 1: A sample response to the user query “Map out the event listeners set up in the view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template used to generate user queries (sampling, temperature 0.6, view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template used to generate diagrams with base models (greedy). view at source ↗
Figure 4
Figure 4. Figure 4: A sample JSON graph and visualization rendered with PlantUML’s toolkit. view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template used to generate diagrams with fine-tuned models (greedy). view at source ↗
Figure 6
Figure 6. Figure 6: Sample generated diagrams. Our dual evaluation framework—combining automatic defect analysis with human rel￾evance assessment—provides a robust methodology for assessing diagram quality beyond simple syntactic correctness. The results validate both research questions: LLMs can gener￾ate relevant graph structures from code (RQ1), and targeted training successfully controls diagram quality properties (RQ2). … view at source ↗
read the original abstract

Software documentation frequently becomes outdated or fails to exist entirely, yet developers need focused views of their codebase to understand complex systems. While automated reverse engineering tools can generate UML diagrams from code, they produce overwhelming detail without considering developer intent. We introduce query-driven UML diagram generation, where LLMs create diagrams that directly answer natural language questions about code. Unlike existing methods, our approach produces semantically focused diagrams containing only relevant elements with contextual descriptions. We fine-tune Qwen2.5-Coder-14B on a curated dataset of code files, developer queries, and corresponding diagram representations in a structured JSON format, evaluating with both automatic detection of structural defects and human assessment of semantic relevance. Results demonstrate that fine-tuning on a modest amount of manually corrected data yields dramatic improvements: our best model achieves the highest F1 scores while reducing defect rates below state-of-the-art LLMs, generating diagrams that are both structurally sound and semantically faithful to developer queries. Thus, we establish the feasibility of using LLMs for scalable contextual, on-demand documentation generation. We make our code and dataset publicly available at https://github.com/i-need-a-pencil/query2diagram.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Query2Diagram, a method for generating focused UML diagrams from natural language developer queries about codebases. It fine-tunes Qwen2.5-Coder-14B on a manually curated dataset of code files, queries, and structured JSON diagram representations, then evaluates using automatic structural defect detection and human semantic assessment. The central claim is that this yields dramatic improvements, with the best model achieving highest F1 scores, lower defect rates than state-of-the-art LLMs, and diagrams that are both structurally sound and semantically faithful; code and dataset are released publicly.

Significance. If the results hold under proper evaluation, the work demonstrates feasibility of scalable, query-driven documentation generation to address outdated or missing software docs. The public release of code and dataset is a clear strength, enabling reproducibility and extension by others.

major comments (2)
  1. [Abstract] Abstract: The claims of 'dramatic improvements,' 'highest F1 scores,' and 'reducing defect rates below state-of-the-art LLMs' are unsupported by any quantitative results, dataset statistics (size, splits), exact metrics, baseline implementations, or significance tests. This absence is load-bearing for the empirical central claim.
  2. [Dataset construction and evaluation sections] Dataset construction and evaluation sections: No details are given on the number of training examples, query/codebase selection criteria, diversity of sources, or any distributional validation against real developer queries (e.g., from issue trackers). The generalization to 'semantically faithful' diagrams in deployment therefore rests on an unverified assumption about representativeness of the manually curated data.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'a modest amount of manually corrected data' is used without any indication of scale (e.g., example count), which would help readers assess the practicality of the fine-tuning approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to incorporate additional details supporting the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims of 'dramatic improvements,' 'highest F1 scores,' and 'reducing defect rates below state-of-the-art LLMs' are unsupported by any quantitative results, dataset statistics (size, splits), exact metrics, baseline implementations, or significance tests. This absence is load-bearing for the empirical central claim.

    Authors: We agree that the abstract should be self-contained and include quantitative support. The current abstract uses qualitative phrasing without specific numbers. In revision we will add key results including the F1 scores, defect rates for our model versus baselines, dataset size and splits, and a brief mention of the evaluation approach. We will also report any statistical significance tests performed or note their omission as a limitation. revision: yes

  2. Referee: [Dataset construction and evaluation sections] Dataset construction and evaluation sections: No details are given on the number of training examples, query/codebase selection criteria, diversity of sources, or any distributional validation against real developer queries (e.g., from issue trackers). The generalization to 'semantically faithful' diagrams in deployment therefore rests on an unverified assumption about representativeness of the manually curated data.

    Authors: We acknowledge that the manuscript currently provides only high-level description of the curated dataset without exact counts or criteria. We will expand the relevant sections to specify the number of training examples, query and codebase selection process, sources used to promote diversity, and steps taken during manual curation. We will also add an explicit discussion of limitations, noting that formal distributional validation against issue-tracker queries was not performed and that generalization claims should be interpreted accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: standard fine-tuning and held-out evaluation

full rationale

The paper presents an empirical ML pipeline: fine-tune Qwen2.5-Coder-14B on a manually curated dataset of code-query-diagram triples, then evaluate with automatic structural defect detection and human semantic assessment. No equations, no fitted parameters renamed as predictions, no self-citation chains, and no uniqueness theorems are invoked. The central claims (highest F1, reduced defects, semantic faithfulness) rest on independent test-set performance rather than reducing to the training inputs by construction. This is the normal non-circular case for supervised learning papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can reliably map natural-language queries to valid UML structures after modest fine-tuning, plus the practical assumption that a small manually corrected dataset suffices for generalization.

axioms (1)
  • domain assumption LLMs fine-tuned on structured JSON diagram representations will produce syntactically valid and semantically relevant UML diagrams for unseen queries
    Invoked in the fine-tuning and evaluation sections of the abstract

pith-pipeline@v0.9.0 · 5547 in / 1120 out tokens · 53981 ms · 2026-05-08T05:58:16.148929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, Sh. Anadkat, et al.,Gpt-4 technical report,https://arxiv. org/abs/2303.08774, 2023

  2. [2]

    Amalfitano, M

    D. Amalfitano, M. De Luca, T. Santilli, and P. Pelliccione,Automated software architec- ture design recovery from source code using llms, Software Architecture, Lecture Notes in Computer Science, vol. 15929, Springer, Cham, 2025, pp. 73–89

  3. [3]

    Arisholm, L

    E. Arisholm, L. C. Briand, S. E. Hove, and Y. Labiche,The impact of uml documentation on software maintenance: An experimental evaluation, IEEE Transactions on Software Engineering32(2006), no. 6, 365–381

  4. [4]

    Babaalla, A

    Z. Babaalla, A. Jakimi, and M. Oualla,Llm-driven mda pipeline for generating uml class diagrams and code, IEEE Access13(2025), 171266–171283

  5. [5]

    Battulga, L

    B. Battulga, L. Tsoodol, E. Dovdon, N. Bold, and O.-E. Namsrai,Metric-based defect prediction from class diagram, Array27(2025), 100438

  6. [6]

    M. Ben Chaaben,Software modeling assistance with large language models, Proceed- ings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, Association for Computing Machinery, 2024, pp. 188–191

  7. [7]

    Boronat and J

    A. Boronat and J. Mustafa,Mdre-llm: A tool for analyzing and applying llms in soft- ware reverse engineering, Proceedings of the 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2025, pp. 850–854

  8. [8]

    C ´amara, J

    J. C ´amara, J. Troya, L. Burgue ˜no, and A. Vallecillo,On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml, Software and Systems Modeling22(2023), no. 3, 781–793

  9. [9]

    De Bari, G

    D. De Bari, G. Garaccione, R. Coppola, M. Torchiano, and L. Ardito,Evaluating large language models in exercises of uml class diagram modeling, Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (New York, NY, USA), ESEM ’24, Association for Computing Machinery, 2024, p. 393–399

  10. [10]

    DeepSeek-AI-Team,Deepseek-r1: Incentivizing reasoning capability in llms via rein- forcement learning,https://arxiv.org/abs/2501.12948, 2025. 17

  11. [11]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer,Qlora: Efficient finetuning of quantized llms, Advances in neural information processing systems36(2023), 10088–10115

  12. [12]

    Egyed,Automated abstraction of class diagrams, ACM Transactions on Software Engineering and Methodology (TOSEM)11(2002), no

    A. Egyed,Automated abstraction of class diagrams, ACM Transactions on Software Engineering and Methodology (TOSEM)11(2002), no. 4, 449–491

  13. [13]

    Elmers,Code and Comment Consistency Classification with Large Language Mod- els, Master’s thesis, Eindhoven University of Technology, Eindhoven, Netherlands, October 2023

    P. Elmers,Code and Comment Consistency Classification with Large Language Mod- els, Master’s thesis, Eindhoven University of Technology, Eindhoven, Netherlands, October 2023

  14. [14]

    Ferrari, S

    A. Ferrari, S. Abualhaijal, and Ch. Arora,Model generation with llms: From requirements to uml sequence diagrams, 2024 IEEE 32nd International Requirements Engineering Conference Workshops (REW), IEEE, 2024, pp. 291–300

  15. [15]

    Genero, M

    M. Genero, M. Piattini, and C. Calero,A survey of metrics for uml class diagrams, Journal of object technology4(2005), no. 9, 59–92

  16. [16]

    T. A. Ghaleb, M. A. Alturki, and Kh. Aljasser,Program comprehension through reverse- engineered sequence diagrams: A systematic review, Journal of Software: Evolution and Process30(2018), no. 11, e1965

  17. [17]

    Gu ´eh´eneuc,A reverse engineering tool for precise class diagrams, Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, 2004, pp

    Y.-G. Gu ´eh´eneuc,A reverse engineering tool for precise class diagrams, Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, 2004, pp. 28–41

  18. [18]

    Gu ´eh´eneuc, R

    Y.-G. Gu ´eh´eneuc, R. Douence, and N. Jussien,No Java without Caffeine: A tool for dynamic analysis of Java programs, Proceedings 17th IEEE International Conference on Automated Software Engineering„ IEEE, 2002, pp. 117–126

  19. [19]

    D. Han, M. Han, and Unsloth Team,Unsloth,http://github.com/unslothai/unsloth, 2023

  20. [20]

    Hebig, T

    R. Hebig, T. H. Quang, M. R. V. Chaudron, G. Robles, and M. A. Fernandez,The quest for open source projects that use uml: mining github, Proceedings of the ACM/IEEE 19th international conference on model driven engineering languages and systems, 2016, pp. 173–183

  21. [21]

    J. Hong, N. Lee, and J. Thorne,Orpo: Monolithic preference optimization without reference model, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 11170–11189

  22. [22]

    Hoque, P

    E. Hoque, P. Kavehzadeh, and A. Masry,Chart question answering: State of the art and future directions, Computer Graphics Forum41(2022), no. 3, 555–572

  23. [23]

    Jahan, M

    M. Jahan, M. M. Hassan, R. Golpayegani, G. Ranjbaran, Ch. Roy, B. Roy, and K. Schnei- der,Automated derivation of uml sequence diagrams from user stories: Unleashing the power of generative ai vs. a rule-based approach, Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (New York, NY, USA), MODELS ’...

  24. [24]

    Jongeling, A

    R. Jongeling, A. Cicchetti, and F. Ciccozzi,How are informal diagrams used in software engineering? an exploratory study of open-source and industrial practices, Software and Systems Modeling24(2025), no. 3, 601–613. 18

  25. [25]

    V. Kan, M. P. Lnu, S. Berhe, C. El Kari, M. Maynard, and F. Khomh,Automated uml visualization of software ecosystems: Tracking versions, dependencies, and security updates, Procedia Computer Science, 8th International Conference on Emerging Data and Industry (EDI40), vol. 257, Elsevier, 2025, pp. 834–841

  26. [26]

    Khiati, Dj

    N. Khiati, Dj. Bouchiha, Y. Atig, and S. Boukli Hacene,Wa2ma: A model-driven approach for reengineering web applications into mobile applications, Edelweiss Applied Science and Technology9(2025), no. 6, 1530–1544

  27. [27]

    Kwon, Zh

    W. Kwon, Zh. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica,Efficient memory management for large language model serving with pagedattention, Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626

  28. [28]

    Lomshakov and S

    V. Lomshakov and S. Nikolenko,Large language models for source code generation and editing, Zapiski Nauchnykh Seminarov POMI540(2024), 276–350

  29. [29]

    Nugroho and M

    A. Nugroho and M. R. V. Chaudron,A survey into the rigor of uml use and its perceived impact on quality and productivity, Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, 2008, pp. 90–99

  30. [30]

    ,Evaluating the impact of uml modeling on software quality: An industrial case study, ACM/IEEE International Conference on Model Driven Engineering Languages and Systems, 2009

  31. [31]

    ,The impact of uml modeling on defect density and defect resolution time in a proprietary system, Empirical Software Engineering19(2014), 926–954

  32. [32]

    Osman, A

    H. Osman, A. van Zadelhoff, and M. R. V. Chaudron,Uml class diagram simplification- a survey for improving reverse engineered class diagram comprehension, Interna- tional Conference on Model-Driven Engineering and Software Development, vol. 2, SCITEPRESS, 2013, pp. 291–296

  33. [33]

    Osman, A

    H. Osman, A. van Zadelhoff, Dave R Stikkolorum, and Michel RV Chaudron,Uml class diagram simplification: What is in the developer’s mind?, Proceedings of the second edition of the international workshop on experiences and empirical studies in software modelling, 2012, pp. 1–6

  34. [34]

    github.io/blog/qwq-32b-preview/, November 2024

    QwenTeam,Qwq: Reflect deeply on the boundaries of the unknown,https://qwenlm. github.io/blog/qwq-32b-preview/, November 2024

  35. [35]

    Shehata, B

    M. Shehata, B. Lepore, H. Cummings, and E. Parra,Creating uml class diagrams with general-purpose llms, 2024 IEEE Working Conference on Software Visualization (VISSOFT), IEEE, 2024, pp. 157–158

  36. [36]

    H. A. Siala and K. Lano,Leveraging llms for abstracting uml and ocl representations from java and python programs,https://papers.ssrn.com/sol3/papers.cfm?abstract_id= 5348203, 2025

  37. [37]

    ,Towards using llms in the reverse engineering of software systems to object con- straint language, Proceedings of the 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2025, pp. 885–890. 19

  38. [38]

    ,Using large language models to extract uml class diagrams from java programs, 8th International Conference on Software and System Engineering (ICoSSE 2025), IEEE, 2025, pp. 70–74

  39. [39]

    ˇStˇep´anek, D

    A. ˇStˇep´anek, D. Ku ˇt´ak, B. Kozl ´ıkov´a, and J. By ˇska,Helveg: Diagrams for software documentation, IEEE Transactions on Visualization and Computer Graphics31(2025), no. 10, 9079–9090

  40. [40]

    Sutton and J

    A. Sutton and J. I. Maletic,Recovering uml class models from c++: A detailed explanation, Information and Software Technology49(2007), no. 3, 212–229

  41. [41]

    Copilot Team,Microsoft Copilot: Your AI companion,https://copilot.microsoft.com/, 2023, Accessed: 2025-07-04

  42. [42]

    Cursor Team,Cursor — The AI Code Editor,https://cursor.com/, 2023, Accessed: 2025-07-04

  43. [43]

    Tonella and A

    P. Tonella and A. Potrich,Reverse engineering of the uml class diagram from c++ code in presence of weakly typed containers, Proceedings IEEE International Conference on Software Maintenance. ICSM 2001, IEEE, 2001, pp. 376–385

  44. [44]

    Unhelkar,Verification and validation for quality of uml 2.0 models, John Wiley & Sons, 2005

    Bh. Unhelkar,Verification and validation for quality of uml 2.0 models, John Wiley & Sons, 2005

  45. [45]

    B. T. Willard and R. Louf,Efficient guided generation for large language models,https: //arxiv.org/abs/2307.09702, 2023

  46. [46]

    A survey on large language models for software engineering,

    Q. Zhang, Ch. Fang, Y. Xie, Y. Zhang, Y. Yang, W. Sun, Sh. Yu, and Zh. Chen,A survey on large language models for software engineering,https://arxiv.org/abs/2312.15223, 2023

  47. [47]

    Zheng, R

    Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Zh. Luo, Zh. Feng, and Y. Ma,Llamafactory: Unified efficient fine-tuning of 100+ language models, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (Bangkok, Thailand), Association for Computational Linguistics, 2024. 20