Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study
Pith reviewed 2026-05-15 13:39 UTC · model grok-4.3
The pith
ChatGPT can generate synthetic system requirement specifications rated realistic by 62% of experts, though contradictions and other flaws remain.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using prompt patterns, LLM-based quality assessments, and iterative refinements, the authors generated 300 synthetic system requirement specifications (SSyRSs) across 10 industries with ChatGPT. Cross-model checks and a survey of 87 experts showed 62% considered the SSyRSs realistic, but in-depth analysis revealed contradictory statements and other deficiencies. The central finding is that realistic SSyRSs can be produced to a certain extent, yet LLM quality assessments cannot replace thorough expert evaluations.
What carries the argument
Iterative prompt patterns with LLM self-assessments to generate and refine SSyRSs without real data access.
If this is right
- Synthetic SSyRSs can supplement scarce real data for requirements engineering research.
- LLM generation works for initial drafts but requires human oversight to catch contradictions.
- Expert surveys provide a necessary check that LLM self-evaluations miss.
- The method scales across industries but quality varies by domain.
Where Pith is reading between the lines
- Generated specs might still serve as training data for NLP tools even with flaws, if post-processed.
- Similar prompting could apply to other confidential natural-language artifacts like test cases or user stories.
- Future studies could test whether fixing detected contradictions improves downstream usability metrics.
Load-bearing premise
Survey ratings of 'realistic' by experts capture the actual utility and correctness of the generated specifications for downstream engineering tasks.
What would settle it
A controlled trial where teams build software from the generated SSyRSs versus real ones, then measure differences in completeness, error rates, or project outcomes.
Figures
read the original abstract
System requirement specifications (SyRSs) are central, natural-language (NL) artifacts. Access to real SyRS for research purposes is highly valuable but limited by proprietary restrictions or confidentiality concerns. Generating synthetic SyRSs (SSyRSs) can address this scarcity. Black-box large language models (LLMs) such as ChatGPT offer compelling generation capabilities by providing easy access to NL generation functions without requiring access to real data. However, LLMs suffer from hallucinations and overconfidence, which pose major challenges in their use. We designed an exploratory study to investigate whether, despite these challenges, we can generate realistic SSyRSs with ChatGPT without having access to real SyRSs. Using a systematic approach that leverages prompt patterns, LLM-based quality assessments, and iterative prompt refinements, we generated 300 SSyRSs across 10 industries with ChatGPT. The results were evaluated using cross-model checks and an expert study, with n=87 submitted surveys. 62\% of experts considered the SSyRSs to be realistic. However, in-depth examination revealed contradictory statements and deficiencies. Overall, we were able to generate realistic SSyRSs to a certain extent with ChatGPT, but LLM-based quality assessments cannot fully replace thorough expert evaluations. This paper presents the methodology and results of our study and discusses the key insights we obtained.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an exploratory case study in which ChatGPT was used to generate 300 synthetic system requirement specifications (SSyRSs) across 10 industries via systematic prompt patterns and iterative refinement. Evaluation combined cross-model checks with an expert survey (n=87), yielding the result that 62% of respondents rated the SSyRSs as realistic. The authors note deficiencies such as contradictory statements within the generated documents and conclude that ChatGPT can produce realistic SSyRSs to a limited extent, while LLM-based quality assessments cannot replace thorough expert evaluations.
Significance. If the evaluation methodology is strengthened, the work would provide a concrete prompting-based approach to generating synthetic requirements artifacts, addressing data scarcity in software engineering research caused by confidentiality constraints. It contributes empirical evidence on LLM capabilities and limitations in requirements engineering and underscores the continued necessity of human oversight, which could inform future tool-building and benchmarking efforts in the field.
major comments (3)
- [Results / Expert Study] The expert survey (described in the Results section) reports a 62% realism rating but provides no inter-rater reliability statistics and includes no control samples of authentic SyRSs. This leaves the 62% figure uncalibrated and makes it difficult to interpret as evidence that the generated artifacts are realistic in an engineering sense.
- [Evaluation and Discussion] No downstream task validation is performed. The generated SSyRSs are not tested for usability in activities such as test-case derivation, inconsistency detection, or traceability analysis, so the link between subjective realism ratings and practical utility remains unexamined.
- [Abstract and Conclusion] The abstract and conclusion treat the 62% expert rating as support for partial success even while acknowledging contradictory statements and deficiencies in the outputs. A clearer reconciliation of these observations with the headline claim is required.
minor comments (3)
- [Methodology] Provide the exact prompt templates and refinement steps in an appendix to support reproducibility.
- [Evaluation] Clarify the cross-model checks: specify which models were compared and the precise criteria used for consistency assessment.
- [Results] Add a table or figure showing realism ratings broken down by industry to allow readers to assess variation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and limitations of our exploratory study. We address each major point below and indicate planned revisions.
read point-by-point responses
-
Referee: [Results / Expert Study] The expert survey (described in the Results section) reports a 62% realism rating but provides no inter-rater reliability statistics and includes no control samples of authentic SyRSs. This leaves the 62% figure uncalibrated and makes it difficult to interpret as evidence that the generated artifacts are realistic in an engineering sense.
Authors: We agree that the lack of inter-rater reliability statistics and authentic control samples limits calibration of the 62% figure. The survey was intentionally designed as an initial perception study focused solely on the generated SSyRSs. We will add a limitations subsection in the revised manuscript that explicitly discusses this design choice and recommends controls and reliability measures for follow-up studies. revision: partial
-
Referee: [Evaluation and Discussion] No downstream task validation is performed. The generated SSyRSs are not tested for usability in activities such as test-case derivation, inconsistency detection, or traceability analysis, so the link between subjective realism ratings and practical utility remains unexamined.
Authors: We concur that downstream validation would strengthen claims about practical utility. Given the exploratory focus on generation and initial realism assessment, such tasks were outside the current scope. We will revise the Discussion section to state this limitation clearly and identify downstream validation as a key item for future work. revision: partial
-
Referee: [Abstract and Conclusion] The abstract and conclusion treat the 62% expert rating as support for partial success even while acknowledging contradictory statements and deficiencies in the outputs. A clearer reconciliation of these observations with the headline claim is required.
Authors: We will revise both the abstract and conclusion to more explicitly reconcile the 62% rating with the identified contradictions and deficiencies. The updated text will stress that the outputs are realistic only to a limited extent and that LLM-based assessments cannot substitute for expert evaluation. revision: yes
Circularity Check
No circularity: empirical results rest on external expert judgments
full rationale
The paper is an empirical case study that generates 300 SSyRSs via ChatGPT prompts and evaluates them through cross-model checks plus an external survey of n=87 experts yielding a 62% realism rate. No equations, fitted parameters, or derivations appear; the central quantitative claim is produced by independent human raters rather than any self-referential construction, self-citation load-bearing premise, or renaming of known results. The methodology (prompt patterns, iterative refinement) is described transparently and does not reduce to its own outputs by definition. This is the normal non-circular outcome for a survey-based generation study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert judgment on a survey is a valid and sufficient measure of whether a synthetic specification is realistic.
Reference graph
Works this paper leans on
-
[1]
International Organization for Standardization, International Electrotechnical Commission, and Institute of Electrical and Electronics Engineers. Iso/iec/ieee 29148:2018 — systems and software engineering — life cycle processes — requirements engineering. Technical Report ISO/IEC/IEEE 29148:2018, ISO/IEC/IEEE, 2018
work page 2018
-
[2]
Mohamad Kassab, Colin Neill, and Phillip Laplante. State of practice in requirements engineering: contemporary data.Innovations in Systems and Software Engineering, 10(4):235–241, 2014
work page 2014
-
[3]
A tertiary study on ai for requirements engineering
Ali Mehraj, Zheying Zhang, and Kari Systä. A tertiary study on ai for requirements engineering. In Daniel Mendez and Ana Moreira, editors,Requirements Engineering: Foundation for Software Quality, pages 159–177, Cham, 2024. Springer Nature Switzerland
work page 2024
-
[4]
Havva Gulay Gurbuz and Bedir Tekinerdogan. Model-based testing for software safety: a systematic mapping study.Software Quality Journal, 26(4):1327–1372, 2018
work page 2018
-
[5]
A. Ferrari, G. O. Spagnolo, and S. Gnesi. Pure: A dataset of public requirements documents. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 502–505, Lisbon, Portugal, 2017. IEEE
work page 2017
-
[6]
Pronaya Bhattacharya, Vivek Kumar Prasad, Ashwin Verma, Deepak Gupta, Assadaporn Sapsomboon, Wattana Viriyasitavat, and Gaurav Dhiman. Demystifying chatgpt: An in-depth survey of openai’s robust large language models.Archives of Computational Methods in Engineering, 31(8):4557–4600, 2024
work page 2024
-
[7]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), 2025
work page 2025
-
[8]
Allen Nie, Yash Chandak, Miroslav Suzara, Malika Ali, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances, 2024
work page 2024
-
[9]
Can ai assistants know what they don’t know?, 2024
Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. Can ai assistants know what they don’t know?, 2024. 14
work page 2024
-
[10]
Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023
work page 2023
-
[11]
Rothenberger, and Samir Chatterjee
Ken Peffers, Tuure Tuunanen, Marcus A. Rothenberger, and Samir Chatterjee. A design science research methodology for information systems research.Journal of Management Information Systems, 24(3):45–77, 2007
work page 2007
-
[12]
A survey on llm-as-a-judge, 2025
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025
work page 2025
-
[13]
Reasoning with large language models, a survey, 2024
Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large language models, a survey, 2024
work page 2024
-
[14]
Reducing hallucination in structured outputs via retrieval-augmented generation
Orlando Ayala and Patrice Bechard. Reducing hallucination in structured outputs via retrieval-augmented generation. In Yi Yang, Aida Davani, Avi Sil, and Anoop Kumar, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 228–2...
work page 2024
-
[15]
Alex R. Mattukat, Florian M. Braun, and Horst Lichter. Can chatgpt generate realistic synthetic system requirement specifications? results of a case study, January 2026. Available at https://zenodo.org/records/16146453
-
[16]
Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, G...
work page 2025
-
[17]
Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design, 2023
work page 2023
-
[18]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019
work page 2019
-
[19]
Addison-Wesley, New Jersey, 2025
Len Bass, Qinghua Lu, Ingo Weber, and Liming Zhu.Engineering AI Systems: Architecture and DevOps Essentials. Addison-Wesley, New Jersey, 2025
work page 2025
-
[20]
Can you trust llm judgments? reliability of llm-as-a-judge, 2025
Kayla Schroeder and Zach Wood-Doughty. Can you trust llm judgments? reliability of llm-as-a-judge, 2025
work page 2025
-
[21]
D. J. Leiner. SoSci Survey (version 3.7.06) [computer software], 2025. Available at https://www.soscisurvey.de
work page 2025
-
[22]
Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. Trust and reliance on ai — an experimental study on the extent and costs of overreliance on ai.Computers in Human Behavior, 160:108352, 2024
work page 2024
-
[23]
Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. In Constantin Enea and Akash Lal, editors,Computer Aided Verification, pages 383–396, Cham, 2023. Springer Nature Switzerland
work page 2023
-
[24]
Experimenting a new programming practice with llms, 2024
Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. Experimenting a new programming practice with llms, 2024
work page 2024
-
[25]
Generating requirements out of thin air: Towards automated feature identification for new apps
Tahira Iqbal, Norbert Seyff, and Daniel Mendez. Generating requirements out of thin air: Towards automated feature identification for new apps. In2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pages 193–199, 2019
work page 2019
-
[26]
An automated model of software requirement engineering using gpt-3.5
Jie Sh’ng Yeow, Muhammad Ehsan Rana, and Nur Amira Abdul Majid. An automated model of software requirement engineering using gpt-3.5. In2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS), pages 1746–1755, 2024
work page 2024
-
[27]
Jordan Peer, Yaniv Mordecai, and Yoram Reich. Nlp4ref: Requirements classification and forecasting: From model-based design to large language models. In2024 IEEE Aerospace Conference, pages 1–16, 2024
work page 2024
-
[28]
J. Sayyad Shirabad and T. J. Menzies. The PROMISE repository of software engineering databases. Technical report, School of Information Technology and Engineering, University of Ottawa, 2005. Accessed 19 July 2025. 15
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.