arxiv: 2604.23816 · v1 · submitted 2026-04-26 · 💻 cs.SE · cs.AI

Recognition: unknown

Query2Diagram: Answering Developer Queries with UML Diagrams

Oleg Baryshnikov (1) , Anton M. Alekseev (2 , 3) , Sergey I. Nikolenko (2 , 3) ((1) HSE University , (2) St. Petersburg Department of Steklov Mathematical Institute , RAS , (3) St. Petersburg State University)

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords UML diagram generationquery-driven documentationLLM fine-tuningsoftware reverse engineeringnatural language to diagramdeveloper queriescode understandingcontextual documentation

0 comments

The pith

Fine-tuned LLMs generate UML diagrams that answer specific developer queries about code with higher accuracy and fewer defects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models can be trained to create focused UML diagrams responding directly to natural language questions about a codebase, rather than producing exhaustive reverse-engineered views. This approach addresses the common problem of outdated or missing documentation by letting developers request only the relevant classes, relationships, and explanations they need at that moment. The authors curate a dataset pairing code snippets with queries and structured JSON diagram outputs, then fine-tune Qwen2.5-Coder-14B on the corrected examples. Their evaluation combines automatic checks for structural defects with human judgment of semantic fit, showing clear gains over untuned models. The work demonstrates that limited human curation can make intent-aware diagram generation practical at scale.

Core claim

Fine-tuning an LLM on a modest collection of manually corrected triples—code files, natural-language developer queries, and corresponding UML diagrams encoded in structured JSON—yields models that produce diagrams with the highest F1 scores for relevant elements and defect rates below those of state-of-the-art untuned LLMs, while ensuring the output remains semantically aligned with the original query.

What carries the argument

The query-driven UML generation process that maps a code file and a natural language query to a focused JSON diagram containing only pertinent classes, relationships, and contextual descriptions.

If this is right

Developers gain on-demand access to precise visual documentation without manually updating or sifting through full diagrams.
Maintenance tasks become easier because queries about specific system aspects return only the relevant structural information.
The need for exhaustive reverse-engineering tools decreases when intent-aware generation is available.
Modest amounts of human-corrected data can produce reliable specialized models for software documentation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curation-plus-fine-tuning pattern could be applied to other visual artifacts such as sequence or component diagrams if comparable datasets are built.
Embedding the model inside an IDE would let developers receive diagrams in response to inline questions while editing code.
The approach underscores that domain-specific performance in software engineering often benefits more from targeted data cleaning than from larger general-purpose models.

Load-bearing premise

The small set of manually corrected training examples reflects the distribution of real developer questions and code structures the model will see after deployment.

What would settle it

Running the best fine-tuned model on a large collection of previously unseen real developer queries drawn from open-source projects and checking whether the generated diagrams maintain higher F1 scores and lower defect rates than baseline LLMs would directly test the claim.

Figures

Figures reproduced from arXiv: 2604.23816 by (2) St. Petersburg Department of Steklov Mathematical Institute, 3), 3) ((1) HSE University, (3) St. Petersburg State University), Anton M. Alekseev (2, Oleg Baryshnikov (1), RAS, Sergey I. Nikolenko (2.

**Figure 1.** Figure 1: A sample response to the user query “Map out the event listeners set up in the view at source ↗

**Figure 2.** Figure 2: Prompt template used to generate user queries (sampling, temperature 0.6, view at source ↗

**Figure 3.** Figure 3: Prompt template used to generate diagrams with base models (greedy). view at source ↗

**Figure 4.** Figure 4: A sample JSON graph and visualization rendered with PlantUML’s toolkit. view at source ↗

**Figure 5.** Figure 5: Prompt template used to generate diagrams with fine-tuned models (greedy). view at source ↗

**Figure 6.** Figure 6: Sample generated diagrams. Our dual evaluation framework—combining automatic defect analysis with human relevance assessment—provides a robust methodology for assessing diagram quality beyond simple syntactic correctness. The results validate both research questions: LLMs can generate relevant graph structures from code (RQ1), and targeted training successfully controls diagram quality properties (RQ2). … view at source ↗

read the original abstract

Software documentation frequently becomes outdated or fails to exist entirely, yet developers need focused views of their codebase to understand complex systems. While automated reverse engineering tools can generate UML diagrams from code, they produce overwhelming detail without considering developer intent. We introduce query-driven UML diagram generation, where LLMs create diagrams that directly answer natural language questions about code. Unlike existing methods, our approach produces semantically focused diagrams containing only relevant elements with contextual descriptions. We fine-tune Qwen2.5-Coder-14B on a curated dataset of code files, developer queries, and corresponding diagram representations in a structured JSON format, evaluating with both automatic detection of structural defects and human assessment of semantic relevance. Results demonstrate that fine-tuning on a modest amount of manually corrected data yields dramatic improvements: our best model achieves the highest F1 scores while reducing defect rates below state-of-the-art LLMs, generating diagrams that are both structurally sound and semantically faithful to developer queries. Thus, we establish the feasibility of using LLMs for scalable contextual, on-demand documentation generation. We make our code and dataset publicly available at https://github.com/i-need-a-pencil/query2diagram.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Query2Diagram, a method for generating focused UML diagrams from natural language developer queries about codebases. It fine-tunes Qwen2.5-Coder-14B on a manually curated dataset of code files, queries, and structured JSON diagram representations, then evaluates using automatic structural defect detection and human semantic assessment. The central claim is that this yields dramatic improvements, with the best model achieving highest F1 scores, lower defect rates than state-of-the-art LLMs, and diagrams that are both structurally sound and semantically faithful; code and dataset are released publicly.

Significance. If the results hold under proper evaluation, the work demonstrates feasibility of scalable, query-driven documentation generation to address outdated or missing software docs. The public release of code and dataset is a clear strength, enabling reproducibility and extension by others.

major comments (2)

[Abstract] Abstract: The claims of 'dramatic improvements,' 'highest F1 scores,' and 'reducing defect rates below state-of-the-art LLMs' are unsupported by any quantitative results, dataset statistics (size, splits), exact metrics, baseline implementations, or significance tests. This absence is load-bearing for the empirical central claim.
[Dataset construction and evaluation sections] Dataset construction and evaluation sections: No details are given on the number of training examples, query/codebase selection criteria, diversity of sources, or any distributional validation against real developer queries (e.g., from issue trackers). The generalization to 'semantically faithful' diagrams in deployment therefore rests on an unverified assumption about representativeness of the manually curated data.

minor comments (1)

[Abstract] Abstract: The phrase 'a modest amount of manually corrected data' is used without any indication of scale (e.g., example count), which would help readers assess the practicality of the fine-tuning approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to incorporate additional details supporting the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of 'dramatic improvements,' 'highest F1 scores,' and 'reducing defect rates below state-of-the-art LLMs' are unsupported by any quantitative results, dataset statistics (size, splits), exact metrics, baseline implementations, or significance tests. This absence is load-bearing for the empirical central claim.

Authors: We agree that the abstract should be self-contained and include quantitative support. The current abstract uses qualitative phrasing without specific numbers. In revision we will add key results including the F1 scores, defect rates for our model versus baselines, dataset size and splits, and a brief mention of the evaluation approach. We will also report any statistical significance tests performed or note their omission as a limitation. revision: yes
Referee: [Dataset construction and evaluation sections] Dataset construction and evaluation sections: No details are given on the number of training examples, query/codebase selection criteria, diversity of sources, or any distributional validation against real developer queries (e.g., from issue trackers). The generalization to 'semantically faithful' diagrams in deployment therefore rests on an unverified assumption about representativeness of the manually curated data.

Authors: We acknowledge that the manuscript currently provides only high-level description of the curated dataset without exact counts or criteria. We will expand the relevant sections to specify the number of training examples, query and codebase selection process, sources used to promote diversity, and steps taken during manual curation. We will also add an explicit discussion of limitations, noting that formal distributional validation against issue-tracker queries was not performed and that generalization claims should be interpreted accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: standard fine-tuning and held-out evaluation

full rationale

The paper presents an empirical ML pipeline: fine-tune Qwen2.5-Coder-14B on a manually curated dataset of code-query-diagram triples, then evaluate with automatic structural defect detection and human semantic assessment. No equations, no fitted parameters renamed as predictions, no self-citation chains, and no uniqueness theorems are invoked. The central claims (highest F1, reduced defects, semantic faithfulness) rest on independent test-set performance rather than reducing to the training inputs by construction. This is the normal non-circular case for supervised learning papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can reliably map natural-language queries to valid UML structures after modest fine-tuning, plus the practical assumption that a small manually corrected dataset suffices for generalization.

axioms (1)

domain assumption LLMs fine-tuned on structured JSON diagram representations will produce syntactically valid and semantically relevant UML diagrams for unseen queries
Invoked in the fine-tuning and evaluation sections of the abstract

pith-pipeline@v0.9.0 · 5547 in / 1120 out tokens · 53981 ms · 2026-05-08T05:58:16.148929+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 4 canonical work pages · 3 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, Sh. Anadkat, et al.,Gpt-4 technical report,https://arxiv. org/abs/2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[2]

Amalfitano, M

D. Amalfitano, M. De Luca, T. Santilli, and P. Pelliccione,Automated software architec- ture design recovery from source code using llms, Software Architecture, Lecture Notes in Computer Science, vol. 15929, Springer, Cham, 2025, pp. 73–89

2025
[3]

Arisholm, L

E. Arisholm, L. C. Briand, S. E. Hove, and Y. Labiche,The impact of uml documentation on software maintenance: An experimental evaluation, IEEE Transactions on Software Engineering32(2006), no. 6, 365–381

2006
[4]

Babaalla, A

Z. Babaalla, A. Jakimi, and M. Oualla,Llm-driven mda pipeline for generating uml class diagrams and code, IEEE Access13(2025), 171266–171283

2025
[5]

Battulga, L

B. Battulga, L. Tsoodol, E. Dovdon, N. Bold, and O.-E. Namsrai,Metric-based defect prediction from class diagram, Array27(2025), 100438

2025
[6]

M. Ben Chaaben,Software modeling assistance with large language models, Proceed- ings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, Association for Computing Machinery, 2024, pp. 188–191

2024
[7]

Boronat and J

A. Boronat and J. Mustafa,Mdre-llm: A tool for analyzing and applying llms in soft- ware reverse engineering, Proceedings of the 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2025, pp. 850–854

2025
[8]

C ´amara, J

J. C ´amara, J. Troya, L. Burgue ˜no, and A. Vallecillo,On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml, Software and Systems Modeling22(2023), no. 3, 781–793

2023
[9]

De Bari, G

D. De Bari, G. Garaccione, R. Coppola, M. Torchiano, and L. Ardito,Evaluating large language models in exercises of uml class diagram modeling, Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (New York, NY, USA), ESEM ’24, Association for Computing Machinery, 2024, p. 393–399

2024
[10]

DeepSeek-AI-Team,Deepseek-r1: Incentivizing reasoning capability in llms via rein- forcement learning,https://arxiv.org/abs/2501.12948, 2025. 17

work page internal anchor Pith review arXiv 2025
[11]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer,Qlora: Efficient finetuning of quantized llms, Advances in neural information processing systems36(2023), 10088–10115

2023
[12]

Egyed,Automated abstraction of class diagrams, ACM Transactions on Software Engineering and Methodology (TOSEM)11(2002), no

A. Egyed,Automated abstraction of class diagrams, ACM Transactions on Software Engineering and Methodology (TOSEM)11(2002), no. 4, 449–491

2002
[13]

Elmers,Code and Comment Consistency Classification with Large Language Mod- els, Master’s thesis, Eindhoven University of Technology, Eindhoven, Netherlands, October 2023

P. Elmers,Code and Comment Consistency Classification with Large Language Mod- els, Master’s thesis, Eindhoven University of Technology, Eindhoven, Netherlands, October 2023

2023
[14]

Ferrari, S

A. Ferrari, S. Abualhaijal, and Ch. Arora,Model generation with llms: From requirements to uml sequence diagrams, 2024 IEEE 32nd International Requirements Engineering Conference Workshops (REW), IEEE, 2024, pp. 291–300

2024
[15]

Genero, M

M. Genero, M. Piattini, and C. Calero,A survey of metrics for uml class diagrams, Journal of object technology4(2005), no. 9, 59–92

2005
[16]

T. A. Ghaleb, M. A. Alturki, and Kh. Aljasser,Program comprehension through reverse- engineered sequence diagrams: A systematic review, Journal of Software: Evolution and Process30(2018), no. 11, e1965

2018
[17]

Gu ´eh´eneuc,A reverse engineering tool for precise class diagrams, Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, 2004, pp

Y.-G. Gu ´eh´eneuc,A reverse engineering tool for precise class diagrams, Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, 2004, pp. 28–41

2004
[18]

Gu ´eh´eneuc, R

Y.-G. Gu ´eh´eneuc, R. Douence, and N. Jussien,No Java without Caffeine: A tool for dynamic analysis of Java programs, Proceedings 17th IEEE International Conference on Automated Software Engineering„ IEEE, 2002, pp. 117–126

2002
[19]

D. Han, M. Han, and Unsloth Team,Unsloth,http://github.com/unslothai/unsloth, 2023

2023
[20]

Hebig, T

R. Hebig, T. H. Quang, M. R. V. Chaudron, G. Robles, and M. A. Fernandez,The quest for open source projects that use uml: mining github, Proceedings of the ACM/IEEE 19th international conference on model driven engineering languages and systems, 2016, pp. 173–183

2016
[21]

J. Hong, N. Lee, and J. Thorne,Orpo: Monolithic preference optimization without reference model, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 11170–11189

2024
[22]

Hoque, P

E. Hoque, P. Kavehzadeh, and A. Masry,Chart question answering: State of the art and future directions, Computer Graphics Forum41(2022), no. 3, 555–572

2022
[23]

Jahan, M

M. Jahan, M. M. Hassan, R. Golpayegani, G. Ranjbaran, Ch. Roy, B. Roy, and K. Schnei- der,Automated derivation of uml sequence diagrams from user stories: Unleashing the power of generative ai vs. a rule-based approach, Proceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (New York, NY, USA), MODELS ’...

2024
[24]

Jongeling, A

R. Jongeling, A. Cicchetti, and F. Ciccozzi,How are informal diagrams used in software engineering? an exploratory study of open-source and industrial practices, Software and Systems Modeling24(2025), no. 3, 601–613. 18

2025
[25]

V. Kan, M. P. Lnu, S. Berhe, C. El Kari, M. Maynard, and F. Khomh,Automated uml visualization of software ecosystems: Tracking versions, dependencies, and security updates, Procedia Computer Science, 8th International Conference on Emerging Data and Industry (EDI40), vol. 257, Elsevier, 2025, pp. 834–841

2025
[26]

Khiati, Dj

N. Khiati, Dj. Bouchiha, Y. Atig, and S. Boukli Hacene,Wa2ma: A model-driven approach for reengineering web applications into mobile applications, Edelweiss Applied Science and Technology9(2025), no. 6, 1530–1544

2025
[27]

Kwon, Zh

W. Kwon, Zh. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica,Efficient memory management for large language model serving with pagedattention, Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626

2023
[28]

Lomshakov and S

V. Lomshakov and S. Nikolenko,Large language models for source code generation and editing, Zapiski Nauchnykh Seminarov POMI540(2024), 276–350

2024
[29]

Nugroho and M

A. Nugroho and M. R. V. Chaudron,A survey into the rigor of uml use and its perceived impact on quality and productivity, Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, 2008, pp. 90–99

2008
[30]

,Evaluating the impact of uml modeling on software quality: An industrial case study, ACM/IEEE International Conference on Model Driven Engineering Languages and Systems, 2009

2009
[31]

,The impact of uml modeling on defect density and defect resolution time in a proprietary system, Empirical Software Engineering19(2014), 926–954

2014
[32]

Osman, A

H. Osman, A. van Zadelhoff, and M. R. V. Chaudron,Uml class diagram simplification- a survey for improving reverse engineered class diagram comprehension, Interna- tional Conference on Model-Driven Engineering and Software Development, vol. 2, SCITEPRESS, 2013, pp. 291–296

2013
[33]

Osman, A

H. Osman, A. van Zadelhoff, Dave R Stikkolorum, and Michel RV Chaudron,Uml class diagram simplification: What is in the developer’s mind?, Proceedings of the second edition of the international workshop on experiences and empirical studies in software modelling, 2012, pp. 1–6

2012
[34]

github.io/blog/qwq-32b-preview/, November 2024

QwenTeam,Qwq: Reflect deeply on the boundaries of the unknown,https://qwenlm. github.io/blog/qwq-32b-preview/, November 2024

2024
[35]

Shehata, B

M. Shehata, B. Lepore, H. Cummings, and E. Parra,Creating uml class diagrams with general-purpose llms, 2024 IEEE Working Conference on Software Visualization (VISSOFT), IEEE, 2024, pp. 157–158

2024
[36]

H. A. Siala and K. Lano,Leveraging llms for abstracting uml and ocl representations from java and python programs,https://papers.ssrn.com/sol3/papers.cfm?abstract_id= 5348203, 2025

2025
[37]

,Towards using llms in the reverse engineering of software systems to object con- straint language, Proceedings of the 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2025, pp. 885–890. 19

2025
[38]

,Using large language models to extract uml class diagrams from java programs, 8th International Conference on Software and System Engineering (ICoSSE 2025), IEEE, 2025, pp. 70–74

2025
[39]

ˇStˇep´anek, D

A. ˇStˇep´anek, D. Ku ˇt´ak, B. Kozl ´ıkov´a, and J. By ˇska,Helveg: Diagrams for software documentation, IEEE Transactions on Visualization and Computer Graphics31(2025), no. 10, 9079–9090

2025
[40]

Sutton and J

A. Sutton and J. I. Maletic,Recovering uml class models from c++: A detailed explanation, Information and Software Technology49(2007), no. 3, 212–229

2007
[41]

Copilot Team,Microsoft Copilot: Your AI companion,https://copilot.microsoft.com/, 2023, Accessed: 2025-07-04

2023
[42]

Cursor Team,Cursor — The AI Code Editor,https://cursor.com/, 2023, Accessed: 2025-07-04

2023
[43]

Tonella and A

P. Tonella and A. Potrich,Reverse engineering of the uml class diagram from c++ code in presence of weakly typed containers, Proceedings IEEE International Conference on Software Maintenance. ICSM 2001, IEEE, 2001, pp. 376–385

2001
[44]

Unhelkar,Verification and validation for quality of uml 2.0 models, John Wiley & Sons, 2005

Bh. Unhelkar,Verification and validation for quality of uml 2.0 models, John Wiley & Sons, 2005

2005
[45]

B. T. Willard and R. Louf,Efficient guided generation for large language models,https: //arxiv.org/abs/2307.09702, 2023

work page internal anchor Pith review arXiv 2023
[46]

A survey on large language models for software engineering,

Q. Zhang, Ch. Fang, Y. Xie, Y. Zhang, Y. Yang, W. Sun, Sh. Yu, and Zh. Chen,A survey on large language models for software engineering,https://arxiv.org/abs/2312.15223, 2023

work page arXiv 2023
[47]

Zheng, R

Y. Zheng, R. Zhang, J. Zhang, Y. Ye, Zh. Luo, Zh. Feng, and Y. Ma,Llamafactory: Unified efficient fine-tuning of 100+ language models, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (Bangkok, Thailand), Association for Computational Linguistics, 2024. 20

2024