pith. machine review for the scientific record. sign in

arxiv: 2605.01392 · v1 · submitted 2026-05-02 · 💻 cs.SE · cs.AI

Recognition: unknown

Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:36 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLMssoftware designChatGPTempirical studyGitHub miningpractitioner surveydesign tasksbenefits and limitations
0
0 comments X

The pith

An empirical study of 291 GitHub ChatGPT conversations and 65 practitioner surveys maps nine categories of software design tasks where LLMs are used, along with interaction patterns, seven benefits, and six limitations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines real-world use of large language models in software design, an area long viewed as requiring deep human judgment. It combines analysis of public developer-ChatGPT exchanges on GitHub with direct input from practicing engineers to identify the kinds of design work these models support and how people actually work with them. The results show that usage concentrates on knowledge gathering and code generation at the detailed design stage, delivering gains such as improved technology selection and earlier flaw detection while also producing verbose, sometimes incorrect, or hallucinated outputs. These observations matter because they replace speculation with evidence about where and how LLMs are already entering a core engineering activity.

Core claim

Mining 291 developer-ChatGPT conversations shared on GitHub and surveying 65 software practitioners reveals nine distinct categories of design tasks supported by the model, such as architecture design, data model design, and application of design patterns. Interactions center on knowledge acquisition and design-related code generation, with the majority of tasks occurring at the detailed design level. Developers report seven benefits including better technology selection and early detection of design flaws, alongside six limitations such as overly lengthy or difficult-to-read outputs, generation of inexecutable or incorrect code, and hallucinations arising from heavy context dependence.

What carries the argument

The mixed-methods empirical analysis that extracts task categories, interaction modes, benefits, and limitations directly from GitHub conversation logs and practitioner survey responses.

If this is right

  • Design tools can be specialized to assist at the detailed level where most current LLM use occurs.
  • Practitioners gain concrete guidance on prompting strategies that maximize benefits like technology selection while reducing risks of incorrect code.
  • Integration efforts should prioritize mechanisms for context management to limit hallucinations.
  • The identified task categories provide a starting taxonomy for evaluating future LLM design assistants.
  • Training programs for developers can incorporate the observed interaction patterns to improve effective use of these models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tension between benefits and limitations implies that effective LLM integration in design will require hybrid human-AI workflows rather than full automation.
  • Findings on context dependence suggest that pairing LLMs with project-specific knowledge bases could reduce hallucinations more reliably than general prompting alone.
  • The concentration at detailed design level raises the question of whether LLMs could scale to earlier architectural stages if supplied with higher-level constraints.

Load-bearing premise

The collected GitHub conversations and survey responses form a representative sample of how software developers in general use LLMs for design work.

What would settle it

A larger, independently collected dataset of LLM-assisted design sessions from closed-source projects or enterprise tools that shows markedly different task distributions, interaction patterns, or net negative instead of mixed outcomes would undermine the reported categories and balance of benefits versus limitations.

Figures

Figures reproduced from arXiv: 2605.01392 by Arif Ali Khan, Mojtaba Shahin, Peng Liang, Qiong Feng, Ruiyin Li, Yangxiao Cai, Yifei Wang, Zengyang Li.

Figure 1
Figure 1. Figure 1: Overview of the mixed-methods research process view at source ↗
Figure 2
Figure 2. Figure 2: Welcome page of the survey questionnaire view at source ↗
Figure 3
Figure 3. Figure 3: Distributions of software development experience and roles of the participants view at source ↗
Figure 4
Figure 4. Figure 4: Counts of the nine software design task categories view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of dialogue rounds per design-related task view at source ↗
Figure 6
Figure 6. Figure 6: Distributions of developers’ purposes for initiating a conversation and software design task levels view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the seven reported benefits of using LLMs in software design view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of the six reported limitations of using LLMs in design view at source ↗
read the original abstract

Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human expertise and judgment. However, there has been little research focusing on how LLMs are used in software design, nor on the associated benefits and drawbacks. This paper aims to bridge this gap by empirically investigating how software developers utilize LLMs in the context of software design. We conduct a mixed-methods study, combining a mining study of 291 developer-ChatGPT conversations shared on GitHub with a survey of 65 software practitioners. Our findings reveal nine distinct categories of design tasks supported by ChatGPT, including architecture design, data model design, and the use of design patterns. We further characterize developer-ChatGPT interactions, showing that developers primarily use ChatGPT for knowledge acquisition and design-related code generation, with most tasks situated at the detailed design level. The study identifies seven key benefits of utilizing LLMs in software design as perceived by developers, such as better technology selection and the early detection of design flaws. We also uncover six limitations, including the generation of overly lengthy and difficult-to-read outputs, the creation of inexecutable or incorrect code, and a heavy reliance on context that can lead to hallucinated results. These findings provide an evidence-based characterization of current LLM use in software design from both open-source and practitioner perspectives, highlighting a tension between perceived benefits and limitations, which lays a foundation for future research and the development of effective techniques and tools to integrate LLMs into software design practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper presents a mixed-methods empirical study that combines mining of 291 developer-ChatGPT conversations shared on GitHub with a survey of 65 software practitioners. It identifies nine categories of design tasks supported by LLMs (e.g., architecture design, data model design, design patterns), characterizes interactions as primarily focused on knowledge acquisition and design-related code generation at the detailed-design level, and reports seven perceived benefits (e.g., improved technology selection, early flaw detection) alongside six limitations (e.g., overly lengthy outputs, inexecutable code, context-dependent hallucinations).

Significance. If the samples are shown to be sufficiently representative, the work supplies a timely, evidence-based snapshot of LLM adoption in a traditionally human-centric SE activity. The mixed-methods design linking public conversation logs to practitioner perceptions is a positive feature, and the explicit enumeration of benefits versus limitations could usefully inform both tool design and follow-on studies. The absence of parameter fitting or circular derivations is also a methodological strength.

major comments (4)
  1. [§3.1] §3.1 (GitHub mining): The paper does not specify the exact search queries, inclusion/exclusion criteria, or temporal window used to collect the 291 conversations, nor does it report how many candidate threads were discarded as non-design-related; without these details the claim that the sample captures 'real-world' use cannot be evaluated.
  2. [§3.2] §3.2 (Survey): The recruitment method, response rate, and demographic profile of the 65 respondents are not reported, leaving open the possibility of convenience or self-selection bias that could systematically over-represent developers already favorably disposed toward LLMs.
  3. [§4] §4 (Qualitative analysis): No information is given on the coding scheme development, number of coders, inter-rater reliability statistic, or disagreement-resolution process used to derive the nine task categories, seven benefits, and six limitations; standard practice in empirical SE requires these metrics to support reliability of the thematic findings.
  4. [§6] §6 (Threats to validity): The discussion acknowledges self-selection but does not quantify or bound its likely impact on the reported distributions (e.g., over-representation of successful interactions), which is load-bearing for the general statements about 'how software developers utilize LLMs'.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 could more explicitly state the final sample sizes (291 conversations, 65 respondents) rather than burying them in the methods section.
  2. [Figure 1] Figure 1 (interaction characterization) would benefit from an accompanying table that reports exact percentages or counts for each interaction type to improve readability.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address each of the major comments below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (GitHub mining): The paper does not specify the exact search queries, inclusion/exclusion criteria, or temporal window used to collect the 291 conversations, nor does it report how many candidate threads were discarded as non-design-related; without these details the claim that the sample captures 'real-world' use cannot be evaluated.

    Authors: We will update Section 3.1 to include the specific search queries used for mining the GitHub conversations, the inclusion and exclusion criteria applied to identify design-related threads, the temporal window of data collection, and the number of candidate threads that were reviewed and discarded. These details will strengthen the transparency of our data collection process and allow for better evaluation of the sample's representativeness. revision: yes

  2. Referee: [§3.2] §3.2 (Survey): The recruitment method, response rate, and demographic profile of the 65 respondents are not reported, leaving open the possibility of convenience or self-selection bias that could systematically over-represent developers already favorably disposed toward LLMs.

    Authors: We will revise Section 3.2 to report the survey recruitment methods, including the platforms and channels used to reach participants, the response rate based on the number of completed surveys out of those invited or accessed, and a demographic breakdown of the 65 respondents. This will help readers assess potential biases such as self-selection. revision: yes

  3. Referee: [§4] §4 (Qualitative analysis): No information is given on the coding scheme development, number of coders, inter-rater reliability statistic, or disagreement-resolution process used to derive the nine task categories, seven benefits, and six limitations; standard practice in empirical SE requires these metrics to support reliability of the thematic findings.

    Authors: We will expand Section 4 to describe the qualitative coding process in detail, including how the coding scheme was developed, the number of coders involved, any inter-rater reliability statistics computed, and how disagreements were resolved. If formal reliability metrics were not calculated, we will explain the collaborative approach taken to ensure consistency in deriving the categories, benefits, and limitations. revision: partial

  4. Referee: [§6] §6 (Threats to validity): The discussion acknowledges self-selection but does not quantify or bound its likely impact on the reported distributions (e.g., over-representation of successful interactions), which is load-bearing for the general statements about 'how software developers utilize LLMs'.

    Authors: We will enhance the threats to validity section (Section 6) by providing a more thorough discussion of self-selection bias, including potential impacts on the distributions of tasks, benefits, and limitations (such as over-representation of successful interactions in public GitHub shares). We will bound these effects where possible and clarify the scope of our general statements about LLM utilization in software design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on direct analysis of external GitHub conversations and survey responses

full rationale

The paper conducts a mixed-methods empirical study by collecting and analyzing 291 developer-ChatGPT conversations from GitHub plus 65 practitioner survey responses. All reported findings—nine task categories, interaction patterns, seven benefits, and six limitations—are obtained through standard qualitative coding and quantitative summarization of this primary external data. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the derivation of the central claims. The analysis is self-contained against the collected artifacts; representativeness concerns affect external validity but do not create internal circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical characterization study; the central claims rest on the representativeness of the sampled conversations and survey responses rather than on any mathematical derivation or new postulated entities.

axioms (1)
  • domain assumption The 291 GitHub conversations and 65 survey responses constitute a representative sample of developer-LLM interactions in software design.
    Generalization from the collected data to broader practice depends on this assumption.

pith-pipeline@v0.9.0 · 5611 in / 1347 out tokens · 44810 ms · 2026-05-09T14:36:01.427042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Shrikara Arun, Meghana Tedla, and Karthik Vaidhyanathan. 2025. LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World. InProceedings of the 22nd IEEE International Conference on Software Architecture (ICSA). IEEE, 25–36

  2. [2]

    James, and Nadia Polikarpova

    Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 85–111

  3. [3]

    Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. 2025. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.arXiv preprint arXiv:2507.09089(2025)

  4. [4]

    1996.Pattern-Oriented Software Architecture - Volume 1: A System of Patterns

    Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad, and Michael Stal. 1996.Pattern-Oriented Software Architecture - Volume 1: A System of Patterns. Wiley Publishing

  5. [5]

    John L Campbell, Charles Quincy, Jordan Osserman, and Ove K Pedersen. 2013. Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement.Sociological Methods & Research42, 3 (2013), 294–320

  6. [6]

    2014.Constructing Grounded Theory

    Kathy Charmaz. 2014.Constructing Grounded Theory. Sage Publishing

  7. [7]

    Daniele De Bari, Giacomo Garaccione, Riccardo Coppola, Marco Torchiano, and Luca Ardito. 2024. Evaluating Large Language Models in Exercises of UML Class Diagram Modeling. InProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, 393–399

  8. [8]

    Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. 2024. Can LLMs Generate Architectural Design Decisions? - An Exploratory Empirical Study. InProceedings of the 21st IEEE International Conference on Software Architecture (ICSA). IEEE, 79–89

  9. [9]

    Andrés Díaz-Pace, Antonela Tommasel, and Rafael Capilla

    J. Andrés Díaz-Pace, Antonela Tommasel, and Rafael Capilla. 2024. Helping Novice Architects to Make Quality Design Decisions Using LLM-Based Assistants. InProceedings of the 17th European Conference on Software Architecture (ECSA). Springer, 324–332

  10. [10]

    Matteo Esposito, Xiaozhou Li, Sergio Moreschini, Noman Ahmad, Tomas Cerny, Karthik Vaidhyanathan, Valentina Lenarduzzi, and Davide Taibi. 2026. Generative AI for Software Architecture: Applications, Challenges, and Future Directions.Journal of Systems and Software231 (2026), 112607

  11. [11]

    Davide Falessi, Lionel C Briand, Giovanni Cantone, Rafael Capilla, and Philippe Kruchten. 2013. The value of design rationale information.ACM Transactions on Software Engineering and Methodology22, 3 (2013), 1–32

  12. [12]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. InProceedings of the IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

  13. [13]

    2003.The Survey Handbook

    Arlene Fink. 2003.The Survey Handbook. Sage Publishing

  14. [14]

    1995.Design Patterns: Elements of Reusable Object- Oriented Software

    Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995.Design Patterns: Elements of Reusable Object- Oriented Software. Addison-Wesley

  15. [15]

    1994.An Introduction to Software Architecture

    David Garlan and Mary Shaw. 1994.An Introduction to Software Architecture. Technical Report. Carnegie Mellon University, Software Engineering Institute. https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=5141

  16. [16]

    Polydoros Giannouris and Sophia Ananiadou. 2025. NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements.arXiv preprint arXiv:2511.22409(2025)

  17. [17]

    GitHub. 2022. Research: Quantifying GitHub Copilot’s impact on developer productivity and happiness. GitHub Blog

  18. [18]

    Github REST API documentation. 2022. Retrieved April 12, 2025 from https://docs.github.com/en/rest?apiVersion=2022- 11-28

  19. [19]

    Binnur Görer and Fatma Başak Aydemir. 2023. Generating Requirements Elicitation Interview Scripts with Large Language Models. InProceedings of the 31st IEEE International Requirements Engineering Conference Workshops (REW). IEEE, 44–51

  20. [20]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  21. [21]

    Jasmin Jahic and Ashkan Sami. 2024. State of Practice: LLMs in Software Engineering and Software Architecture. In Proceedings of the 21st IEEE International Conference on Software Architecture Companion (ICSA-C). IEEE, 311–318

  22. [22]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InProceedings of the International Conference ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. 0:28 Wang et al. on Learning Representations (IC...

  23. [23]

    Parampreet Kaur, Jill Stoltzfus, and Vikas Yellapu. 2018. Descriptive statistics.International Journal of Academic Medicine4, 1 (2018), 60–63

  24. [24]

    Kitchenham and Shari L

    Barbara A. Kitchenham and Shari L. Pfleeger. 2008.Personal Opinion Surveys. Springer, Chapter 3, 63–92

  25. [25]

    Ruiyin Li, Peng Liang, Mohamed Soliman, and Paris Avgeriou. 2022. Understanding software architecture erosion: A systematic mapping study.Journal of Software: Evolution and Process34, 3 (2022), e2423

  26. [26]

    Ruiyin Li, Peng Liang, Yifei Wang, Yangxiao Cai, Weisong Sun, and Zengyang Li. 2026. Unveiling the Role of ChatGPT in Software Development: Insights from Developer-ChatGPT Interactions on GitHub.ACM Transactions on Software Engineering and Methodology(2026)

  27. [27]

    Yishu Li, Jacky Keung, Xiaoxue Ma, Chun Yong Chong, Jingyu Zhang, and Yihan Liao. 2024. LLM-Based Class Diagram Derivation from User Stories with Chain-of-Thought Promptings. InProceedings of the 48th IEEE Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 45–50

  28. [28]

    Jenny T Liang, Chenyang Yang, and Brad A Myers. 2024. A large-scale survey on the usability of AI programming assis- tants: Successes and challenges. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 1–13

  29. [29]

    Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. Reading between the lines: Modeling user behavior and costs in AI-assisted programming. InProceedings of the 44th CHI Conference on Human Factors in Computing Systems (CHI). ACM, 1–16

  30. [30]

    Mohit Pandey, Tanmay Chand, Jennifer Horkoff, Miroslaw Staron, Miroslaw Ochodek, and Milos Durisic. 2025. Design Pattern Recognition: A Study of Large Language Models.Empirical Software Engineering30, 3, Article 69 (2025)

  31. [31]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.Microsoft Research(2023)

  32. [32]

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 15174–15186

  33. [33]

    2024.Requirements Engineering Using Generative AI: Prompts and Prompting Patterns

    Krishna Ronanki, Beatriz Cabrero-Daniel, Jennifer Horkoff, and Christian Berger. 2024.Requirements Engineering Using Generative AI: Prompts and Prompting Patterns. Springer, Chapter 5, 109–127

  34. [34]

    Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D

    Steven I. Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D. Weisz. 2023. The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development. InProceedings of the 28th International Conference on Intelligent User Interfaces (IUI). ACM, 491–514

  35. [35]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools.arXiv preprint arXiv:2302.04761(2023)

  36. [36]

    Larissa Schmid, Tobias Hey, Martin Armbruster, Sophie Corallo, Dominik Fuchß, Jan Keim, Haoyu Liu, and Anne Koziolek. 2025. Software Architecture Meets LLMs: A Systematic Literature Review.arXiv preprint arXiv:2505.16697 (2025)

  37. [37]

    Antony Tang and Hans Van Vliet. 2009. Modeling constraints improves software architecture design reasoning. In Proceedings of the Joint Working IEEE/IFIP Conference on Software Architecture & European Conference on Software Architecture (WICSA/ECSA). IEEE, 253–256

  38. [38]

    Dan Tofan, Matthias Galster, and Paris Avgeriou. 2013. Difficulty of Architectural Decisions – A Survey with Professional Architects. InProceedings of the 7th European Conference on Software Architecture (ECSA). Springer, 192–199

  39. [39]

    Yifei Wang, Ruiyin Li, Peng Liang, Yangxiao Cai, Zengyang Li, and Mojtaba Shahin. 2026. Replication Package of the Paper: Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey. https: //github.com/WhitenWhiten/LLM4Design

  40. [40]

    Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design.arXiv preprint arXiv:2303.07839 (2023)

  41. [41]

    2012.Experimentation in Software Engineering

    Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012.Experimentation in Software Engineering. Springer Science & Business Media

  42. [42]

    Chen Yang, Peng Liang, and Paris Avgeriou. 2018. Assumptions and their management in software development: A systematic mapping study.Information and Software Technology94 (2018), 82–110

  43. [43]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. 2022. ReAct: Synergizing Reasoning and Acting in Language Models.arXiv preprint arXiv:2210.03629(2022)

  44. [44]

    Jiarui Zhang et al . 2024. Agentless: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024). ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey 0:29

  45. [45]

    Xiyu Zhou, Ruiyin Li, Peng Liang, Beiqi Zhang, Mojtaba Shahin, Zengyang Li, and Chen Yang. 2025. Using LLMs in generating design rationale for software architecture decisions.ACM Transactions on Software Engineering and Methodology(2025). ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026