pith. sign in

arxiv: 2605.26898 · v1 · pith:DD2AC3JPnew · submitted 2026-05-26 · 💻 cs.SE · cs.AI

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

Pith reviewed 2026-06-29 15:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code generationdesign patternsSingleton patternprompt engineeringautomated feedbacksoftware architectureHumanEval-XJava code
0
0 comments X

The pith

Iterative binary feedback best aligns LLM code generation with the Singleton design pattern while preserving functionality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests prompting strategies to make LLMs generate code that follows the Singleton design pattern rather than just producing functional but unstructured solutions. Across 13 models and 164 Java tasks from HumanEval-X, the authors compare instructions alone, binary automated feedback, extensive feedback, and few-shot variants. Results show that the best strategy is model-dependent, yet binary feedback overall yields the highest Singleton adherence rates while maintaining or raising the number of passing tests. This addresses a practical gap as LLMs are deployed for software engineering tasks where architectural consistency matters for maintainability.

Core claim

The central finding is that iterative binary automated feedback provides the best overall alignment with the Singleton pattern across models while preserving or improving code functionality, with model-specific optima such as instructions alone achieving 100% Singleton adherence and a 34.1 percentage point gain in tests passed for Llama 3.3, and binary feedback reaching 99.2% alignment plus 58.6% functionality for Qwen 3 (8B).

What carries the argument

Four prompting strategies (plain instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot examples) applied iteratively to detect and enforce Singleton adherence in generated Java code.

If this is right

  • Simple instructions alone can drive certain models like Llama 3.3 to 100% Singleton usage while also raising functional correctness.
  • Binary feedback scales well for models like Qwen 3 (8B) to reach near-perfect pattern adherence without sacrificing test passage rates.
  • Strategy choice must be tuned per model rather than applied uniformly.
  • Even lightweight feedback loops suffice to steer LLMs toward established design patterns in code generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may transfer to other creational or structural patterns if similar automated detectors can be built.
  • Combining binary feedback with larger context windows could improve results on more complex multi-class designs.
  • These techniques might reduce the need for post-generation refactoring in LLM-assisted development pipelines.

Load-bearing premise

The automated checker used to score whether generated code follows the Singleton pattern does so accurately and without systematic bias.

What would settle it

Running the same 164 tasks with an independent human review or alternative detector that finds substantially lower Singleton adherence under binary feedback than reported would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2605.26898 by Farnaz Fotrousi, Miroslaw Staron, Viktor Kjellberg.

Figure 1
Figure 1. Figure 1: A class representing the software of an engine. To the left, without implementing Singleton. On the right, the engine [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The design and process of the four experiments. In each experiment design the input is shown in the white boxes [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: On the left side of each sub-figure: the number of fulfilled predicates per model. On the right side of each sub-figure: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The rate of passed test for all models in all experiments. Green represents passed tests, yellow represents generated [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principles to generated code is crucial to the long-term success of software products. Therefore, the goal of this paper is to identify strategies for guiding LLMs to incorporate design patterns into the generated source code. We designed a computational experiment to evaluate the ability of 13 LLMs to generate code that follows the Singleton design pattern, using four prompting strategies: instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts, in 164 Java coding challenges from HumanEval-X. Our results shows that the optimal strategy to guide LLMs to include design patterns depends heavily on the type of model. Still, overall, iterative binary feedback provides the best alignment with Singleton while preserving or improving the code's functionality. With guiding with instructions, Llama 3.3 generated Singleton classes in 100% of cases and improved code functionality, increasing the number of tests passed by 34.1 percentage points. It achieved a similar result with guidance through instructions and binary feedback. Qwen 3 (8B) increased the alignment with Singleton to 99.2% and the functionality to 58.6% using binary feedback. Our result suggests that even simple strategies can be used to guide LLMs to use design patterns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript reports an empirical study evaluating the effectiveness of four prompting strategies—instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts—in guiding 13 large language models to generate Java code adhering to the Singleton design pattern across 164 tasks from the HumanEval-X benchmark. The key finding is that iterative binary feedback generally provides the best balance of Singleton alignment and preserved or improved functionality, with examples including Llama 3.3 achieving 100% Singleton generation and a 34.1 percentage point increase in tests passed using instructions, and Qwen 3 (8B) reaching 99.2% alignment and 58.6% functionality with binary feedback.

Significance. If the automated detection of Singleton adherence proves reliable, the work supplies concrete, model-specific guidance on prompting strategies for enforcing design patterns in LLM-generated code, a practically relevant contribution to software engineering given the growing use of LLMs for code synthesis. The scale (13 models, 164 tasks) and use of a public benchmark are strengths that support reproducibility and comparability.

major comments (1)
  1. [Abstract] Abstract: All reported metrics (e.g., 100% Singleton for Llama 3.3, 99.2% for Qwen 3, functionality deltas) depend on an unspecified automated detector for Singleton adherence in generated Java code. No description of the detection rules (static analysis, AST patterns, regex on private constructor + getInstance, or LLM judge), precision/recall, or human validation is provided. This is load-bearing for the central claim that binary feedback improves alignment while preserving functionality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying the need for greater transparency around the automated detector, which is indeed central to our claims. We address the comment below and will make the requested changes in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: All reported metrics (e.g., 100% Singleton for Llama 3.3, 99.2% for Qwen 3, functionality deltas) depend on an unspecified automated detector for Singleton adherence in generated Java code. No description of the detection rules (static analysis, AST patterns, regex on private constructor + getInstance, or LLM judge), precision/recall, or human validation is provided. This is load-bearing for the central claim that binary feedback improves alignment while preserving functionality.

    Authors: We agree that the description of the automated detector is insufficient in the current manuscript and that this detail is load-bearing. The detector is implemented via static analysis (JavaParser AST traversal to confirm a private constructor and a public static getInstance() method returning the instance), but the manuscript provides only a high-level reference. In the revision we will add a dedicated subsection in Methods that specifies the exact rules, the library used, edge-case handling, and results of human validation on a random sample of 50 outputs (reporting precision, recall, and agreement rate). revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement on public benchmark

full rationale

The paper conducts a computational experiment applying four prompting strategies to 13 LLMs across 164 HumanEval-X Java tasks and reports measured percentages of Singleton adherence and test-pass rates. No equations, fitted parameters, derivations, or predictions appear. No self-citations are invoked to justify core claims. The results stand or fall on the reported experimental outcomes against the external benchmark; the measurement procedure itself is not shown to reduce to a self-definition or prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical evaluation study. It introduces no free parameters, invented entities, or non-standard axioms beyond the background assumption that LLMs generate code and that design patterns are desirable.

axioms (2)
  • domain assumption LLMs can generate functional source code from natural-language prompts
    Stated as given in the opening sentence of the abstract.
  • domain assumption Following established design patterns improves long-term software quality
    Implicit premise motivating the entire study.

pith-pipeline@v0.9.1-grok · 5802 in / 1406 out tokens · 56494 ms · 2026-06-29T15:42:03.096470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Xinyuan Song, and Junhao Song. 2025. Is GPT-OSS Good? A Comprehensive Eval- uation of OpenAI’s Latest Open Source Models. arXiv:2508.12461 [cs.CL] https://arxiv.org/abs/2508.12461

  2. [2]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  3. [3]

    2007.Pattern-oriented software architecture, on patterns and pattern languages

    Frank Buschmann, Kevlin Henney, and Douglas C Schmidt. 2007.Pattern-oriented software architecture, on patterns and pattern languages. John wiley & sons

  4. [4]

    Pinzhen Chen, Zhicheng Guo, Barry Haddow, and Kenneth Heafield. 2024. Itera- tive translation refinement with large language models. InProceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1). 181–190

  5. [5]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug.arXiv preprint arXiv:2304.05128(2023)

  6. [6]

    1995.Design patterns: elements of reusable object-oriented software

    Erich Gamma. 1995.Design patterns: elements of reusable object-oriented software. Vol. 431. Addison-Wesley

  7. [7]

    Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R. Lyu. 2023. What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs?. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 761–773. doi:10.1109/ASE56229.2023. 00109

  8. [8]

    Jasmin Jahić and Ashkan Sami. 2024. State of Practice: LLMs in Software En- gineering and Software Architecture. In2024 IEEE 21st International Confer- ence on Software Architecture Companion (ICSA-C). 311–318. doi:10.1109/ICSA- C63560.2024.00059

  9. [9]

    Dae-Kyoo Kim. 2025. Comparative analysis of design pattern implementation validity in LLM-based code refactoring.Journal of Systems and Software230 (2025), 112519. doi:10.1016/j.jss.2025.112519

  10. [10]

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. Language Models can Solve Computer Tasks. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 39648–39677

  11. [11]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  12. [12]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

  13. [13]

    Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157

  14. [14]

    Phu H Nguyen, Koen Yskout, Thomas Heyman, Jacques Klein, Riccardo Scan- dariato, and Yves Le Traon. 2015. Sospa: A system of security design patterns for systematically engineering secure systems. In2015 ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (MODELS). IEEE, 246–255

  15. [15]

    Zhenyu Pan, Xuefeng Song, Yunkun Wang, Rongyu Cao, Binhua Li, Yongbin Li, and Han Liu. 2025. Do Code LLMs Understand Design Patterns?. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). 209–212. doi:10.1109/LLM4Code66737.2025.00031

  16. [16]

    Sushant Kumar Pandey, Miroslaw Staron, Jennifer Horkoff, Mirosław Ochodek, Nicholas Mucci, and Darko Durisic. 2023. TransDPR: Design Pattern Recog- nition Using Programming Language Models. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–7. doi:10.1109/ESEM56168.2023.10304862

  17. [17]

    Yun Peng, Akhilesh Deepak Gotmare, Michael R Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. Perfcodegen: Improving performance of llm generated code with execution feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). IEEE, 1–13

  18. [18]

    Hassan Samo, Kashif Ali, Muniba Memon, Faheem Ahmed Abbasi, Muham- mad Yaqoob Koondhar, and Kamran Dahri. 2024. Fine-Tuning Mistral 7b Large Language Model For Python Query Response And Code Generation: A Parameter Efficient Approach.V A WKUM Transactions on Computer Sciences12, 1 (Jun. 2024), 205–217. doi:10.21015/vtcs.v12i1.1885

  19. [19]

    Agnia Sergeyuk, Yaroslav Golubev, Timofey Bryksin, and Iftekhar Ahmed. 2025. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward.Information and Software Technology178 (2025), 107610. doi:10. 1016/j.infsof.2024.107610

  20. [20]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learn- ing. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 8634–8652

  21. [21]

    Miroslaw Staron and Silvia Abrahão. 2025. Exploring Generative AI in Automated Software Engineering.IEEE Software42, 3 (2025), 142–145. doi:10.1109/MS.2025. 3533754

  22. [22]

    Krzysztof Stencel and Patrycja Węgrzynowicz. 2008. Implementation Variants of the Singleton Design Pattern. InOn the Move to Meaningful Internet Systems: OTM 2008 Workshops, Robert Meersman, Zahir Tari, and Pilar Herrero (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 396–406

  23. [23]

    Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2025. Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 645–645

  24. [24]

    Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, and Wentao Zhang. 2026. CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation. arXiv:2504.21751 [cs.SE] https: //arxiv.org/abs/2504.21751

  25. [25]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...