pith. sign in

arxiv: 2603.09335 · v2 · submitted 2026-03-10 · 💻 cs.SE

Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Pith reviewed 2026-05-15 13:39 UTC · model grok-4.3

classification 💻 cs.SE
keywords ChatGPTsynthetic datasystem requirements specificationsrequirements engineeringlarge language modelshallucinationsexpert evaluationprompt engineering
0
0 comments X

The pith

ChatGPT can generate synthetic system requirement specifications rated realistic by 62% of experts, though contradictions and other flaws remain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores whether ChatGPT can create realistic synthetic system requirement specifications without access to real examples. Researchers generated 300 such documents across 10 industries using structured prompts and refinements. An expert survey found 62% rated them realistic, yet detailed review exposed issues like contradictory statements. This approach could ease data scarcity for research in requirements engineering. The study concludes that expert evaluation is still required, as LLM-based checks fall short.

Core claim

Using prompt patterns, LLM-based quality assessments, and iterative refinements, the authors generated 300 synthetic system requirement specifications (SSyRSs) across 10 industries with ChatGPT. Cross-model checks and a survey of 87 experts showed 62% considered the SSyRSs realistic, but in-depth analysis revealed contradictory statements and other deficiencies. The central finding is that realistic SSyRSs can be produced to a certain extent, yet LLM quality assessments cannot replace thorough expert evaluations.

What carries the argument

Iterative prompt patterns with LLM self-assessments to generate and refine SSyRSs without real data access.

If this is right

  • Synthetic SSyRSs can supplement scarce real data for requirements engineering research.
  • LLM generation works for initial drafts but requires human oversight to catch contradictions.
  • Expert surveys provide a necessary check that LLM self-evaluations miss.
  • The method scales across industries but quality varies by domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Generated specs might still serve as training data for NLP tools even with flaws, if post-processed.
  • Similar prompting could apply to other confidential natural-language artifacts like test cases or user stories.
  • Future studies could test whether fixing detected contradictions improves downstream usability metrics.

Load-bearing premise

Survey ratings of 'realistic' by experts capture the actual utility and correctness of the generated specifications for downstream engineering tasks.

What would settle it

A controlled trial where teams build software from the generated SSyRSs versus real ones, then measure differences in completeness, error rates, or project outcomes.

Figures

Figures reproduced from arXiv: 2603.09335 by Alex R. Mattukat, Florian M. Braun, Horst Lichter.

Figure 1
Figure 1. Figure 1: The SSyRS generation process (colors indicate loops, italic comments describe loop conditions). [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Excerpt of the logistics SSyRS “Dynamic Freight Optimization Platform (DFOP)”. The whole SSyRS can be [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the overall rating of the degree of realism of the SSyRSs. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

System requirement specifications (SyRSs) are central, natural-language (NL) artifacts. Access to real SyRS for research purposes is highly valuable but limited by proprietary restrictions or confidentiality concerns. Generating synthetic SyRSs (SSyRSs) can address this scarcity. Black-box large language models (LLMs) such as ChatGPT offer compelling generation capabilities by providing easy access to NL generation functions without requiring access to real data. However, LLMs suffer from hallucinations and overconfidence, which pose major challenges in their use. We designed an exploratory study to investigate whether, despite these challenges, we can generate realistic SSyRSs with ChatGPT without having access to real SyRSs. Using a systematic approach that leverages prompt patterns, LLM-based quality assessments, and iterative prompt refinements, we generated 300 SSyRSs across 10 industries with ChatGPT. The results were evaluated using cross-model checks and an expert study, with n=87 submitted surveys. 62\% of experts considered the SSyRSs to be realistic. However, in-depth examination revealed contradictory statements and deficiencies. Overall, we were able to generate realistic SSyRSs to a certain extent with ChatGPT, but LLM-based quality assessments cannot fully replace thorough expert evaluations. This paper presents the methodology and results of our study and discusses the key insights we obtained.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper reports an exploratory case study in which ChatGPT was used to generate 300 synthetic system requirement specifications (SSyRSs) across 10 industries via systematic prompt patterns and iterative refinement. Evaluation combined cross-model checks with an expert survey (n=87), yielding the result that 62% of respondents rated the SSyRSs as realistic. The authors note deficiencies such as contradictory statements within the generated documents and conclude that ChatGPT can produce realistic SSyRSs to a limited extent, while LLM-based quality assessments cannot replace thorough expert evaluations.

Significance. If the evaluation methodology is strengthened, the work would provide a concrete prompting-based approach to generating synthetic requirements artifacts, addressing data scarcity in software engineering research caused by confidentiality constraints. It contributes empirical evidence on LLM capabilities and limitations in requirements engineering and underscores the continued necessity of human oversight, which could inform future tool-building and benchmarking efforts in the field.

major comments (3)
  1. [Results / Expert Study] The expert survey (described in the Results section) reports a 62% realism rating but provides no inter-rater reliability statistics and includes no control samples of authentic SyRSs. This leaves the 62% figure uncalibrated and makes it difficult to interpret as evidence that the generated artifacts are realistic in an engineering sense.
  2. [Evaluation and Discussion] No downstream task validation is performed. The generated SSyRSs are not tested for usability in activities such as test-case derivation, inconsistency detection, or traceability analysis, so the link between subjective realism ratings and practical utility remains unexamined.
  3. [Abstract and Conclusion] The abstract and conclusion treat the 62% expert rating as support for partial success even while acknowledging contradictory statements and deficiencies in the outputs. A clearer reconciliation of these observations with the headline claim is required.
minor comments (3)
  1. [Methodology] Provide the exact prompt templates and refinement steps in an appendix to support reproducibility.
  2. [Evaluation] Clarify the cross-model checks: specify which models were compared and the precise criteria used for consistency assessment.
  3. [Results] Add a table or figure showing realism ratings broken down by industry to allow readers to assess variation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our exploratory study. We address each major point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Results / Expert Study] The expert survey (described in the Results section) reports a 62% realism rating but provides no inter-rater reliability statistics and includes no control samples of authentic SyRSs. This leaves the 62% figure uncalibrated and makes it difficult to interpret as evidence that the generated artifacts are realistic in an engineering sense.

    Authors: We agree that the lack of inter-rater reliability statistics and authentic control samples limits calibration of the 62% figure. The survey was intentionally designed as an initial perception study focused solely on the generated SSyRSs. We will add a limitations subsection in the revised manuscript that explicitly discusses this design choice and recommends controls and reliability measures for follow-up studies. revision: partial

  2. Referee: [Evaluation and Discussion] No downstream task validation is performed. The generated SSyRSs are not tested for usability in activities such as test-case derivation, inconsistency detection, or traceability analysis, so the link between subjective realism ratings and practical utility remains unexamined.

    Authors: We concur that downstream validation would strengthen claims about practical utility. Given the exploratory focus on generation and initial realism assessment, such tasks were outside the current scope. We will revise the Discussion section to state this limitation clearly and identify downstream validation as a key item for future work. revision: partial

  3. Referee: [Abstract and Conclusion] The abstract and conclusion treat the 62% expert rating as support for partial success even while acknowledging contradictory statements and deficiencies in the outputs. A clearer reconciliation of these observations with the headline claim is required.

    Authors: We will revise both the abstract and conclusion to more explicitly reconcile the 62% rating with the identified contradictions and deficiencies. The updated text will stress that the outputs are realistic only to a limited extent and that LLM-based assessments cannot substitute for expert evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external expert judgments

full rationale

The paper is an empirical case study that generates 300 SSyRSs via ChatGPT prompts and evaluates them through cross-model checks plus an external survey of n=87 experts yielding a 62% realism rate. No equations, fitted parameters, or derivations appear; the central quantitative claim is produced by independent human raters rather than any self-referential construction, self-citation load-bearing premise, or renaming of known results. The methodology (prompt patterns, iterative refinement) is described transparently and does not reduce to its own outputs by definition. This is the normal non-circular outcome for a survey-based generation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert survey responses constitute a sufficient proxy for realism and that prompt-based generation can be evaluated without reference to real proprietary documents.

axioms (1)
  • domain assumption Expert judgment on a survey is a valid and sufficient measure of whether a synthetic specification is realistic.
    Invoked when interpreting the 62% figure as evidence of successful generation.

pith-pipeline@v0.9.0 · 5546 in / 1171 out tokens · 46115 ms · 2026-05-15T13:39:57.737459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    Iso/iec/ieee 29148:2018 — systems and software engineering — life cycle processes — requirements engineering

    International Organization for Standardization, International Electrotechnical Commission, and Institute of Electrical and Electronics Engineers. Iso/iec/ieee 29148:2018 — systems and software engineering — life cycle processes — requirements engineering. Technical Report ISO/IEC/IEEE 29148:2018, ISO/IEC/IEEE, 2018

  2. [2]

    State of practice in requirements engineering: contemporary data.Innovations in Systems and Software Engineering, 10(4):235–241, 2014

    Mohamad Kassab, Colin Neill, and Phillip Laplante. State of practice in requirements engineering: contemporary data.Innovations in Systems and Software Engineering, 10(4):235–241, 2014

  3. [3]

    A tertiary study on ai for requirements engineering

    Ali Mehraj, Zheying Zhang, and Kari Systä. A tertiary study on ai for requirements engineering. In Daniel Mendez and Ana Moreira, editors,Requirements Engineering: Foundation for Software Quality, pages 159–177, Cham, 2024. Springer Nature Switzerland

  4. [4]

    Model-based testing for software safety: a systematic mapping study.Software Quality Journal, 26(4):1327–1372, 2018

    Havva Gulay Gurbuz and Bedir Tekinerdogan. Model-based testing for software safety: a systematic mapping study.Software Quality Journal, 26(4):1327–1372, 2018

  5. [5]

    Ferrari, G

    A. Ferrari, G. O. Spagnolo, and S. Gnesi. Pure: A dataset of public requirements documents. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 502–505, Lisbon, Portugal, 2017. IEEE

  6. [6]

    Demystifying chatgpt: An in-depth survey of openai’s robust large language models.Archives of Computational Methods in Engineering, 31(8):4557–4600, 2024

    Pronaya Bhattacharya, Vivek Kumar Prasad, Ashwin Verma, Deepak Gupta, Assadaporn Sapsomboon, Wattana Viriyasitavat, and Gaurav Dhiman. Demystifying chatgpt: An in-depth survey of openai’s robust large language models.Archives of Computational Methods in Engineering, 31(8):4557–4600, 2024

  7. [7]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), 2025

  8. [8]

    The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances, 2024

    Allen Nie, Yash Chandak, Miroslav Suzara, Malika Ali, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances, 2024

  9. [9]

    Can ai assistants know what they don’t know?, 2024

    Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. Can ai assistants know what they don’t know?, 2024. 14

  10. [10]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023

  11. [11]

    Rothenberger, and Samir Chatterjee

    Ken Peffers, Tuure Tuunanen, Marcus A. Rothenberger, and Samir Chatterjee. A design science research methodology for information systems research.Journal of Management Information Systems, 24(3):45–77, 2007

  12. [12]

    A survey on llm-as-a-judge, 2025

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025

  13. [13]

    Reasoning with large language models, a survey, 2024

    Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large language models, a survey, 2024

  14. [14]

    Reducing hallucination in structured outputs via retrieval-augmented generation

    Orlando Ayala and Patrice Bechard. Reducing hallucination in structured outputs via retrieval-augmented generation. In Yi Yang, Aida Davani, Avi Sil, and Anoop Kumar, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 228–2...

  15. [15]

    Mattukat, Florian M

    Alex R. Mattukat, Florian M. Braun, and Horst Lichter. Can chatgpt generate realistic synthetic system requirement specifications? results of a case study, January 2026. Available at https://zenodo.org/records/16146453

  16. [16]

    Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, and Philip Resnik

    Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, G...

  17. [17]

    Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design, 2023

  18. [18]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

  19. [19]

    Addison-Wesley, New Jersey, 2025

    Len Bass, Qinghua Lu, Ingo Weber, and Liming Zhu.Engineering AI Systems: Architecture and DevOps Essentials. Addison-Wesley, New Jersey, 2025

  20. [20]

    Can you trust llm judgments? reliability of llm-as-a-judge, 2025

    Kayla Schroeder and Zach Wood-Doughty. Can you trust llm judgments? reliability of llm-as-a-judge, 2025

  21. [21]

    D. J. Leiner. SoSci Survey (version 3.7.06) [computer software], 2025. Available at https://www.soscisurvey.de

  22. [22]

    Trust and reliance on ai — an experimental study on the extent and costs of overreliance on ai.Computers in Human Behavior, 160:108352, 2024

    Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. Trust and reliance on ai — an experimental study on the extent and costs of overreliance on ai.Computers in Human Behavior, 160:108352, 2024

  23. [23]

    nl2spec: Interactively translating unstructured natural language to temporal logics with large language models

    Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. In Constantin Enea and Akash Lal, editors,Computer Aided Verification, pages 383–396, Cham, 2023. Springer Nature Switzerland

  24. [24]

    Experimenting a new programming practice with llms, 2024

    Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. Experimenting a new programming practice with llms, 2024

  25. [25]

    Generating requirements out of thin air: Towards automated feature identification for new apps

    Tahira Iqbal, Norbert Seyff, and Daniel Mendez. Generating requirements out of thin air: Towards automated feature identification for new apps. In2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pages 193–199, 2019

  26. [26]

    An automated model of software requirement engineering using gpt-3.5

    Jie Sh’ng Yeow, Muhammad Ehsan Rana, and Nur Amira Abdul Majid. An automated model of software requirement engineering using gpt-3.5. In2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS), pages 1746–1755, 2024

  27. [27]

    Nlp4ref: Requirements classification and forecasting: From model-based design to large language models

    Jordan Peer, Yaniv Mordecai, and Yoram Reich. Nlp4ref: Requirements classification and forecasting: From model-based design to large language models. In2024 IEEE Aerospace Conference, pages 1–16, 2024

  28. [28]

    Sayyad Shirabad and T

    J. Sayyad Shirabad and T. J. Menzies. The PROMISE repository of software engineering databases. Technical report, School of Information Technology and Engineering, University of Ottawa, 2005. Accessed 19 July 2025. 15