arxiv: 2604.00851 · v1 · submitted 2026-04-01 · 💻 cs.SE

Recognition: no theorem link

Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding

Rabia Iftikhar , Andreas Rausch

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:25 UTC · model grok-4.3

classification 💻 cs.SE

keywords large language modelsdesign synthesisUML class diagramsprompt engineeringsoftware designnon-determinismfew-shot promptingobject-oriented principles

0 comments

The pith

Preference-based prompting improves LLM adherence to design intent in UML diagrams but leaves substantial non-determinism intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can reliably perform design synthesis by generating UML class diagrams that follow object-oriented principles from natural language descriptions. It introduces a preference-based few-shot prompting method and tests it against standard and rule-injection prompting on three models using two custom benchmarks with repeated runs. Results show the preference approach increases how closely outputs match intended design structures and patterns. Yet even this method fails to remove run-to-run variation, and differences between models turn out to be a stronger driver of overall reliability than the choice of prompting technique.

Core claim

Across 540 experiments, preference-based few-shot prompting biases model outputs toward designs that satisfy object-oriented principles and pattern-consistent structures more effectively than standard prompting or rule-injection prompting. This alignment improves adherence to design intent on the two benchmarks, but non-determinism persists in all three models and model-level behavior exerts the dominant influence on design reliability.

What carries the argument

Preference-based few-shot prompting that biases outputs toward object-oriented principles and pattern-consistent structures.

If this is right

Preference-based prompting can be applied to raise the quality of LLM-generated software designs relative to simpler methods.
Model choice must be evaluated separately because it affects reliability more than prompting strategy does.
Repeated sampling remains necessary to judge output stability even after preference alignment.
Standard prompting and rule-injection prompting deliver weaker adherence to design intent than the preference approach.
Achieving dependable LLM-assisted design requires attention to both prompting technique and underlying model robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preference-alignment technique could be adapted for other software engineering tasks such as requirements-to-code translation.
Combining outputs from multiple models or adding post-generation checks might further reduce the variance that prompting alone leaves behind.
Design benchmarks for LLMs will need broader domain coverage to confirm whether the reliability patterns hold outside the two cases studied.
Teams using LLMs for design work should log variance across runs rather than treating any single output as definitive.

Load-bearing premise

The two custom design-intent benchmarks with three paraphrased prompts each adequately represent real-world design synthesis requirements.

What would settle it

Running the same three prompting methods on a new set of design problems drawn from additional domains and observing no gain in intent adherence or no reduction in variance would falsify the main result.

Figures

Figures reproduced from arXiv: 2604.00851 by Andreas Rausch, Rabia Iftikhar.

**Figure 1.** Figure 1: Proposed Approach good design solution, we asked our modeling experts to define good design solutions against the generated problems. To be clear, all five data points in the dataset include a design-level description, two generated class diagram solutions (solution1: by LLM, solution2: by modeling experts), and preferred choice labeled by a human annotator (i.e. the expert generated model). The preference… view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce syntactically valid diagrams, syntactic correctness alone does not guarantee meaningful design. This study investigates whether LLMs can move beyond diagram translation to perform design synthesis, and how reliably they maintain design-oriented reasoning under variation. We introduce a preference-based few-shot prompting approach that biases LLM outputs toward designs satisfying object-oriented principles and pattern-consistent structures. Two design-intent benchmarks, each with three domain-only, paraphrased prompts and 10 repeated runs, are used to evaluate three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting, totaling 540 experiments (i.e. 2x3x10x3x3). Results indicate that while preference-based alignment improves adherence to design intent it does not eliminate non-determinism, and model-level behavior strongly influences design reliability. These findings highlight that achieving dependable LLM-assisted software design requires not only effective prompting but also careful consideration of model behavior and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Preference prompting lifts design intent adherence in LLM UML synthesis but leaves non-determinism and model differences in place, measured on two narrow custom benchmarks.

read the letter

The main thing to know is that this paper runs a controlled comparison of prompting strategies for LLM-based UML class diagram generation and finds that a preference-based few-shot approach improves how well outputs match design intent, yet non-determinism persists and model choice drives most of the reliability differences. They report 540 experiments across three models, three strategies, two benchmarks, three prompts per benchmark, and ten repeats each. That scale gives concrete numbers on variance and sensitivity that earlier syntactic-validity papers did not supply. The work is useful because it moves the discussion from “can the model draw a valid diagram” to “does it produce a design that respects the stated intent and OO principles.” The preference method shows measurable gains on their adherence checks, which is a practical signal for tool builders. The soft spots sit in the benchmarks and the evaluation. Only two custom design scenarios exist, each with three paraphrased prompts, and the abstract gives no evidence they were validated against external design corpora or that adherence scoring had inter-rater checks. Ten runs per cell may still miss tail behavior in these models. Without the full methods, statistical tests, or raw data it is difficult to judge how stable the model-level claims really are. The paper is aimed at software engineering researchers and practitioners who are integrating LLMs into design tools and need early empirical bounds on reliability. It deserves peer review because the experimental skeleton is straightforward and the question is timely; referees can push on benchmark breadth and analysis details without the core setup being broken.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study of three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) performing UML class-diagram synthesis from natural-language descriptions. It introduces a preference-based few-shot prompting strategy and compares it to standard and rule-injection prompting on two custom design-intent benchmarks (each with three paraphrased prompts). Across 540 experiments (10 repetitions per prompt), the authors conclude that preference-based prompting improves adherence to object-oriented principles without eliminating non-determinism and that model-level behavior dominates reliability.

Significance. If the results are reproducible, the work supplies concrete evidence on the limits of current prompting techniques for design synthesis tasks and underscores the need to account for model-specific variance in LLM-assisted software engineering. The explicit experimental scale (540 runs) and focus on repeated sampling are positive features that allow direct inspection of output stability.

major comments (2)

[Abstract and §3] Abstract and §3 (Benchmarks): the central claim that preference-based prompting improves design reliability rests on the assumption that the two custom benchmarks (six prompts total) are representative of real-world design synthesis. No validation against external design corpora, no inter-rater agreement statistics for adherence judgments, and no justification for the narrow task distribution are provided; this directly weakens the generalizability of the headline result.
[§4] §4 (Experimental Setup): the manuscript states that 540 experiments were performed but supplies no statistical protocol for quantifying non-determinism or for testing differences across prompting strategies. Absence of variance measures (e.g., standard deviation per prompt), confidence intervals, or significance tests leaves the reported improvements and model effects unsupported by formal analysis.

minor comments (2)

Add a table that explicitly breaks down the 2×3×10×3×3 design so readers can verify the total of 540 runs.
Define the exact scoring rubric used for 'adherence to design intent' and state whether it was applied by the authors or by independent raters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Benchmarks): the central claim that preference-based prompting improves design reliability rests on the assumption that the two custom benchmarks (six prompts total) are representative of real-world design synthesis. No validation against external design corpora, no inter-rater agreement statistics for adherence judgments, and no justification for the narrow task distribution are provided; this directly weakens the generalizability of the headline result.

Authors: We deliberately constructed the two custom benchmarks to isolate specific design intents while controlling for prompt paraphrasing, enabling direct measurement of output variance under repeated sampling. This controlled setup was chosen to focus on the core research questions rather than broad coverage. We acknowledge the absence of external corpus validation and formal inter-rater statistics. In the revised manuscript we will expand §3 with explicit justification for domain and task selection, add a limitations subsection on generalizability, and provide a detailed description of the author-defined adherence rubric used for judgments. A multi-rater agreement study was not conducted and cannot be added retrospectively. revision: partial
Referee: [§4] §4 (Experimental Setup): the manuscript states that 540 experiments were performed but supplies no statistical protocol for quantifying non-determinism or for testing differences across prompting strategies. Absence of variance measures (e.g., standard deviation per prompt), confidence intervals, or significance tests leaves the reported improvements and model effects unsupported by formal analysis.

Authors: We agree that formal statistical support is needed. We will revise §4 to describe the full statistical protocol, report standard deviations and variances of adherence scores across the ten repetitions per prompt, include confidence intervals, and add significance testing (e.g., ANOVA for model and strategy effects, with post-hoc comparisons). These quantitative results will be integrated into the results section to substantiate the reported differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct output comparisons

full rationale

The paper conducts an empirical study by running 540 LLM experiments on two custom design-intent benchmarks (each with three paraphrased prompts) and comparing outputs across prompting strategies. No equations, derivations, fitted parameters, or first-principles claims exist that could reduce to inputs by construction. Results derive from direct measurement of adherence and variance rather than any self-referential definitions or self-citation chains. The central claims rest on the experimental data itself, which is externally falsifiable via replication on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the two introduced design-intent benchmarks and the assumption that preference-based prompting can measurably bias outputs toward OO principles without introducing new artifacts.

axioms (1)

domain assumption The custom benchmarks accurately reflect meaningful design synthesis tasks
Invoked when interpreting adherence improvements as evidence of design synthesis capability.

pith-pipeline@v0.9.0 · 5525 in / 1200 out tokens · 40816 ms · 2026-05-13T22:25:02.265333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

[1]

Large Language Models in Generating UML Class Diagram,

R. Iftikhar, "Large Language Models in Generating UML Class Diagram,” GitHub repository, 2025. [Online]. Available: https://github.com/RabiaIftikhar01/ Large-Language-Models-in-Generating-UML-Class-Diagram

work page 2025
[2]

Applications of AI in Classical Software Engineering,

M. Barenkamp, J. Rebstadt, and O. Thomas, “Applications of AI in Classical Software Engineering,” AI Perspectives & Ad- vances, vol. 2, no. 1, 2020, doi: 10.1186/s42467-020-00005-

work page doi:10.1186/s42467-020-00005- 2020
[3]

Available: https://aiperspectives.springeropen.com/ articles/10.1186/s42467-020-00005-4

[Online]. Available: https://aiperspectives.springeropen.com/ articles/10.1186/s42467-020-00005-4

work page doi:10.1186/s42467-020-00005-4
[4]

Application of large language models to soft- ware engineering tasks: Opportunities, risks, and implica- tions,

˙I. Özkaya, "Application of large language models to soft- ware engineering tasks: Opportunities, risks, and implica- tions,” IEEE Software , vol. 40, no. 3, pp. 4–8, 2023, doi: 10.1109/MS.2023.3248401

work page doi:10.1109/ms.2023.3248401 2023
[5]

ChatGPT as a Software Development Bot: A Project- Based Study,

M. Waseem, T. Das, A. Ahmad, P . Liang, M. Fehmideh, and T. Mikkonen, “ChatGPT as a Software Development Bot: A Project- Based Study,” in Proc. 19th Int. Conf. Evaluation of Novel Approaches to Software Engineering (ENASE) , Feb. 2024, doi: 10.5220/0012631600003687. Preprint: arXiv:2310.13648 (Oct. 2023)

work page doi:10.5220/0012631600003687 2024
[6]

ChatGPT Prompt Patterns for Improving Code Quality, Refac- toring, Requirements Elicitation, and Software Design,

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “ChatGPT Prompt Patterns for Improving Code Quality, Refac- toring, Requirements Elicitation, and Software Design,” arXiv preprint arXiv:2303.07839 , 2023

work page arXiv 2023
[7]

Behavioral Augmentation of UML Class Diagrams: An Empirical Study of Large Language Mod- els for Method Generation,

D. Rouabhia and I. Hadjadj, “Behavioral Augmentation of UML Class Diagrams: An Empirical Study of Large Language Mod- els for Method Generation,” arXiv preprint arXiv:2506.00788 , 2025

work page arXiv 2025
[8]

A Student-Centric Evaluation Survey to Explore the Impact of LLMs on UML Modeling,

B. Al-Ahmad, A. Alsobeh, O. Meqdadi, and N. Shaikh, “A Student-Centric Evaluation Survey to Explore the Impact of LLMs on UML Modeling,” Information, vol. 16, no. 7, p. 565, 2025, doi: 10.3390/info16070565

work page doi:10.3390/info16070565 2025
[9]

Automated Domain Modeling with Large Language Models: A Comparative Study,

K. Chen, Y. Yang, B. Chen, J. A. Hernández López, G. Muss- bacher, and D. Varró, “Automated Domain Modeling with Large Language Models: A Comparative Study,” in Proc. 26th ACM/IEEE Int. Conf. Model Driven Eng. Lang. Syst. (MODELS) , Västerås, Sweden, Oct. 2023, pp. 162–172, doi: 10.1109/mod- els58315.2023.00037

work page doi:10.1109/mod- 2023
[10]

Towards Human-Bot Collaborative Software Architecting with ChatGPT,

A. Ahmad, M. Waseem, P . Liang, M. Fahmideh, M. S. Aktar, and T. Mikkonen, “Towards Human-Bot Collaborative Software Architecting with ChatGPT,” in Proc. 27th Int. Conf. Evaluation and Assessment in Software Engineering (EASE) , New York, NY, USA: Association for Computing Machinery, 2023, pp. 279– 285, doi: 10.1145/3593434.3593468

work page doi:10.1145/3593434.3593468 2023
[11]

Navigating the Complexity of Generative AI Adoption in Software Engineering,

D. Russo, “Navigating the Complexity of Generative AI Adoption in Software Engineering,” ACM Trans. Softw . Eng. Methodol. , vol. 33, no. 5, pp. 1–50, Jun. 2024, doi: 10.1145/3652154. [Online]. Available: https://dl.acm.org/doi/10.1145/3652154

work page doi:10.1145/3652154 2024
[12]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei et al. , “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv preprint arXiv:2201.11903 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S. Bubeck et al., “Sparks of Artificial General Intelligence: Early Experiments with GPT-4,” arXiv preprint arXiv:2303.12712 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Asleep at the Keyboard? Assessing the Secu- rity of GitHub Copilot’s Code Contributions,

H. Pearce et al. , “Asleep at the Keyboard? Assessing the Secu- rity of GitHub Copilot’s Code Contributions,” IEEE Symposium on Security and Privacy , 2023

work page 2023
[15]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

X. Wang et al. , “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” arXiv preprint arXiv:2203.11171, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Evaluating and Enhancing Large Language Models in Generating UML Class Diagram for Good Code Design,

R. Iftikhar and A. Rausch, “Evaluating and Enhancing Large Language Models in Generating UML Class Diagram for Good Code Design,” in Women in Machine Learning Workshop @ NeurIPS, 2025

work page 2025
[17]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . Pinto, J. Kaplan, . . . , and W . Zaremba, “Evaluating Large Language Models trained on code,” arXiv preprint arXiv:2107.03374 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Leveraging LLMs to Automate Software Architecture Design from Informal Spec- ifications,

A. Tagliaferro, S. Corboe, and B. Guindani, "Leveraging LLMs to Automate Software Architecture Design from Informal Spec- ifications,” in *Proc. 2025 IEEE 22nd Int. Conf. Softw. Ar- chit. Companion (ICSA-C)*, pp. 291–299, IEEE, 2025. doi: 10.1109/ICSA-C2025

work page doi:10.1109/icsa-c2025 2025
[19]

An LLM-Assisted Ap- proach to Designing Software Architectures Using ADD,

H. Cervantes, R. Kazman, and Y. Cai, “An LLM-Assisted Ap- proach to Designing Software Architectures Using ADD,” arXiv preprint arXiv:2506.22688 , 2025

work page arXiv 2025
[20]

Towards Class Diagram Generation from User Stories Using LLMs,

M. Ojha, S. Gupta, and R. Sharma, "Towards Class Diagram Generation from User Stories Using LLMs,” in *Proc. 2025 Int. Conf. Next Generation Information System Engineering (NGISE)*, vol. 1, IEEE, 2025. doi: 10.1109/NGISE2025

work page doi:10.1109/ngise2025 2025
[21]

Auto-logging: Ai-centred logging instrumentation

M. Ben Chaaben, L. Burgueño, and H. Sahraoui, “Towards Us- ing Few-Shot Prompt Learning for Automating Model Comple- tion,” in Proc. 45th Int. Conf. Software Engineering: New Ideas and Emerging Results (ICSE-NIER’23) , Sep. 2023, IEEE/ACM, doi: 10.1109/ICSE-NIER58687.2023.00008. Preprint available on arXiv:2212.03404 (Dec. 7, 2022)

work page doi:10.1109/icse-nier58687.2023.00008 2023
[22]

A Model Is Not Built By A Single Prompt: LLM-Based Domain Modeling With Question Decomposition,

R. Chen, J. Shen, and X. He, “A Model Is Not Built By A Single Prompt: LLM-Based Domain Modeling With Question Decomposition,” Preprint, arXiv:2410.09854, 2024. Submitted Oct. 13, 2024

work page arXiv 2024
[23]

Aligning crowd-sourced human feed- back for reinforcement learning on code generation by large language models,

M. F . Wong and C. W . Tan, "Aligning crowd-sourced human feed- back for reinforcement learning on code generation by large language models,” IEEE Transactions on Big Data , pp. 1–12, 2024, doi: 10.1109/TBDATA.2024.3524104. Preprint available on arXiv:2503.15129

work page doi:10.1109/tbdata.2024.3524104 2024
[24]

On the Assessment of Genera- tive AI in Modeling Tasks: An Experience Report with Chat- GPT and UML,

J. Cámara-Moreno, J. Troya-Castilla, L. Burgueño-Caballero, and A. J. Vallecillo-Moreno, “On the Assessment of Genera- tive AI in Modeling Tasks: An Experience Report with Chat- GPT and UML,” Software and Systems Modeling , vol. 22, no. 3, pp. 781–793, 2023, doi: 10.1007/s10270-023-01105-

work page doi:10.1007/s10270-023-01105- 2023
[25]

Available: https://link.springer.com/article/10.1007/ s10270-023-01105-5

[Online]. Available: https://link.springer.com/article/10.1007/ s10270-023-01105-5

work page
[26]

Eriksson and M

H.-E. Eriksson and M. Penker, Mastering UML with Rational Rose 2002, Vol. 1 . Alameda, CA, USA: Sybex, 2002, ISBN 9780782140604

work page 2002
[27]

R. S. Pressman and B. R. Maxim, Software Engineering: A Practitioner’s Approach, 9th ed. McGraw-Hill Education, 2020, ISBN 9781259872976

work page 2020
[28]

Gamma, R

E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Pat- terns: Elements of Reusable Software Architecture . Reading, MA, USA: Addison-Wesley, 1995

work page 1995
[29]

Hunt and D

A. Hunt and D. Thomas, The Pragmatic Programmer: From Journeyman to Master . Boston, MA, USA: Addison-Wesley Pro- fessional, 1999, ISBN 9780201616224

work page 1999
[30]

Large language models for software engineering: A survey of evaluations, applications, and challenges,

J. Liu, Z. Chen, Y. Wang, et al. , "Large language models for software engineering: A survey of evaluations, applications, and challenges,” Frontiers in Computer Science , vol. 7, p. 1519437, 2025, doi: 10.3389/fcomp.2025.1519437

work page doi:10.3389/fcomp.2025.1519437 2025
[31]

Large language models for model-driven engineering: Opportunities, challenges, and future directions,

J. Cámara, M. Wimmer, and E. Burger, "Large language models for model-driven engineering: Opportunities, challenges, and future directions,” arXiv preprint arXiv:2306.00788 , 2023

work page arXiv 2023
[32]

Toward standardized benchmarks for large language models in model-driven engi- neering,

J. Cámara, M. Wimmer, and E. Burger, "Toward standardized benchmarks for large language models in model-driven engi- neering,” Software and Systems Modeling , Springer, 2024, doi: 10.1007/s10270-024-01206-9

work page doi:10.1007/s10270-024-01206-9 2024
[33]

LLM-Based Class Diagram Derivation from User Stories with Chain-of-Thought Promptings,

Y. Li, J. Keung, X. Ma, C. Y. Chong, J. Zhang, and Y. Liao, “LLM-Based Class Diagram Derivation from User Stories with Chain-of-Thought Promptings,” in Proc. 2024 IEEE 48th Annu. Comput., Softw ., and Appl. Conf. (COMPSAC), Osaka, Japan, Jul. 2024, pp. 45–50, doi: 10.1109/COMPSAC61105.2024.00017

work page doi:10.1109/compsac61105.2024.00017 2024
[34]

Using Large Language Models to Support Software Engineering Documentation in Waterfall Life Cycles: Are We There Yet?,

A. Della Porta, V . De Martino, G. Recupito, C. Iemmino, G. Catolino, D. Di Nucci, and F . Palomba, “Using Large Language Models to Support Software Engineering Documentation in Waterfall Life Cycles: Are We There Yet?,” in CEUR Workshop Proc., vol. 3762 , 2024, pp. 452–457. Ital-IA Intelligenza Artifi- ciale – Thematic Workshops

work page 2024
[35]

Decor: A method for the specification and detection of code and design smells,

N. Moha, Y.-G. Guéhéneuc, L. Duchien, and A.-F . Le Meur, “Decor: A method for the specification and detection of code and design smells,” IEEE Transactions on Software Engineer- ing, vol. 36, no. 1, pp. 20–36, 2009

work page 2009
[36]

Can LLMs replace manual annotation of software engineering artifacts?

T. Ahmed, P . Devanbu, C. Treude, and M. Pradel, “Can LLMs replace manual annotation of software engineering artifacts?” in Proc. 2025 IEEE/ACM 22nd Int. Conf. on Mining Software Repositories (MSR) , pp. 526–538, Apr. 2025

work page 2025