Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Alex R. Mattukat; Florian M. Braun; Horst Lichter

arxiv: 2603.09335 · v2 · submitted 2026-03-10 · 💻 cs.SE

Can ChatGPT Generate Realistic Synthetic System Requirement Specifications? Results of a Case Study

Alex R. Mattukat , Florian M. Braun , Horst Lichter This is my paper

Pith reviewed 2026-05-15 13:39 UTC · model grok-4.3

classification 💻 cs.SE

keywords ChatGPTsynthetic datasystem requirements specificationsrequirements engineeringlarge language modelshallucinationsexpert evaluationprompt engineering

0 comments

The pith

ChatGPT can generate synthetic system requirement specifications rated realistic by 62% of experts, though contradictions and other flaws remain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores whether ChatGPT can create realistic synthetic system requirement specifications without access to real examples. Researchers generated 300 such documents across 10 industries using structured prompts and refinements. An expert survey found 62% rated them realistic, yet detailed review exposed issues like contradictory statements. This approach could ease data scarcity for research in requirements engineering. The study concludes that expert evaluation is still required, as LLM-based checks fall short.

Core claim

Using prompt patterns, LLM-based quality assessments, and iterative refinements, the authors generated 300 synthetic system requirement specifications (SSyRSs) across 10 industries with ChatGPT. Cross-model checks and a survey of 87 experts showed 62% considered the SSyRSs realistic, but in-depth analysis revealed contradictory statements and other deficiencies. The central finding is that realistic SSyRSs can be produced to a certain extent, yet LLM quality assessments cannot replace thorough expert evaluations.

What carries the argument

Iterative prompt patterns with LLM self-assessments to generate and refine SSyRSs without real data access.

If this is right

Synthetic SSyRSs can supplement scarce real data for requirements engineering research.
LLM generation works for initial drafts but requires human oversight to catch contradictions.
Expert surveys provide a necessary check that LLM self-evaluations miss.
The method scales across industries but quality varies by domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Generated specs might still serve as training data for NLP tools even with flaws, if post-processed.
Similar prompting could apply to other confidential natural-language artifacts like test cases or user stories.
Future studies could test whether fixing detected contradictions improves downstream usability metrics.

Load-bearing premise

Survey ratings of 'realistic' by experts capture the actual utility and correctness of the generated specifications for downstream engineering tasks.

What would settle it

A controlled trial where teams build software from the generated SSyRSs versus real ones, then measure differences in completeness, error rates, or project outcomes.

Figures

Figures reproduced from arXiv: 2603.09335 by Alex R. Mattukat, Florian M. Braun, Horst Lichter.

**Figure 2.** Figure 2: Excerpt of the logistics SSyRS “Dynamic Freight Optimization Platform (DFOP)”. The whole SSyRS can be [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of the overall rating of the degree of realism of the SSyRSs. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

System requirement specifications (SyRSs) are central, natural-language (NL) artifacts. Access to real SyRS for research purposes is highly valuable but limited by proprietary restrictions or confidentiality concerns. Generating synthetic SyRSs (SSyRSs) can address this scarcity. Black-box large language models (LLMs) such as ChatGPT offer compelling generation capabilities by providing easy access to NL generation functions without requiring access to real data. However, LLMs suffer from hallucinations and overconfidence, which pose major challenges in their use. We designed an exploratory study to investigate whether, despite these challenges, we can generate realistic SSyRSs with ChatGPT without having access to real SyRSs. Using a systematic approach that leverages prompt patterns, LLM-based quality assessments, and iterative prompt refinements, we generated 300 SSyRSs across 10 industries with ChatGPT. The results were evaluated using cross-model checks and an expert study, with n=87 submitted surveys. 62\% of experts considered the SSyRSs to be realistic. However, in-depth examination revealed contradictory statements and deficiencies. Overall, we were able to generate realistic SSyRSs to a certain extent with ChatGPT, but LLM-based quality assessments cannot fully replace thorough expert evaluations. This paper presents the methodology and results of our study and discusses the key insights we obtained.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete case study showing ChatGPT can produce synthetic system requirements that 62% of 87 experts call realistic, but the ratings rest on uncalibrated subjective judgments without real controls or downstream checks.

read the letter

The main takeaway is that this work delivers a practical data point on using LLMs for synthetic SyRS generation across ten industries. They produced 300 documents with prompt patterns and iterative refinement, ran cross-model checks, and collected expert feedback showing partial realism. That specific quantitative result on this artifact type fills a small gap in the requirements engineering literature where real specs are hard to share. The authors are straightforward about the limits too, noting contradictory statements in the outputs and that LLM self-assessments fall short of expert review. That honesty helps the contribution land as exploratory rather than overstated. The setup is straightforward and the scale is reasonable for a first look at the problem. The soft spots sit in the evaluation. The survey asked only for realism ratings on the generated items with no real SyRS controls mixed in for calibration, no inter-rater reliability numbers, and no follow-up tasks like deriving test cases or spotting inconsistencies to see if the rated documents actually work. The 62% figure therefore stays at the level of subjective impression rather than demonstrated utility. The paper itself flags deficiencies, so the central claim stays qualified. This is the kind of study that fits a requirements engineering reading group or a workshop on LLM applications in SE. Readers working on data scarcity for empirical RE work will find the method and numbers useful as a starting point. It deserves a serious referee because the empirical core is clear enough to review and improve, even if the validation needs tightening. I would send it out for peer review with the expectation that the authors add controls and a small downstream task to strengthen the evidence.

Referee Report

3 major / 3 minor

Summary. The paper reports an exploratory case study in which ChatGPT was used to generate 300 synthetic system requirement specifications (SSyRSs) across 10 industries via systematic prompt patterns and iterative refinement. Evaluation combined cross-model checks with an expert survey (n=87), yielding the result that 62% of respondents rated the SSyRSs as realistic. The authors note deficiencies such as contradictory statements within the generated documents and conclude that ChatGPT can produce realistic SSyRSs to a limited extent, while LLM-based quality assessments cannot replace thorough expert evaluations.

Significance. If the evaluation methodology is strengthened, the work would provide a concrete prompting-based approach to generating synthetic requirements artifacts, addressing data scarcity in software engineering research caused by confidentiality constraints. It contributes empirical evidence on LLM capabilities and limitations in requirements engineering and underscores the continued necessity of human oversight, which could inform future tool-building and benchmarking efforts in the field.

major comments (3)

[Results / Expert Study] The expert survey (described in the Results section) reports a 62% realism rating but provides no inter-rater reliability statistics and includes no control samples of authentic SyRSs. This leaves the 62% figure uncalibrated and makes it difficult to interpret as evidence that the generated artifacts are realistic in an engineering sense.
[Evaluation and Discussion] No downstream task validation is performed. The generated SSyRSs are not tested for usability in activities such as test-case derivation, inconsistency detection, or traceability analysis, so the link between subjective realism ratings and practical utility remains unexamined.
[Abstract and Conclusion] The abstract and conclusion treat the 62% expert rating as support for partial success even while acknowledging contradictory statements and deficiencies in the outputs. A clearer reconciliation of these observations with the headline claim is required.

minor comments (3)

[Methodology] Provide the exact prompt templates and refinement steps in an appendix to support reproducibility.
[Evaluation] Clarify the cross-model checks: specify which models were compared and the precise criteria used for consistency assessment.
[Results] Add a table or figure showing realism ratings broken down by industry to allow readers to assess variation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our exploratory study. We address each major point below and indicate planned revisions.

read point-by-point responses

Referee: [Results / Expert Study] The expert survey (described in the Results section) reports a 62% realism rating but provides no inter-rater reliability statistics and includes no control samples of authentic SyRSs. This leaves the 62% figure uncalibrated and makes it difficult to interpret as evidence that the generated artifacts are realistic in an engineering sense.

Authors: We agree that the lack of inter-rater reliability statistics and authentic control samples limits calibration of the 62% figure. The survey was intentionally designed as an initial perception study focused solely on the generated SSyRSs. We will add a limitations subsection in the revised manuscript that explicitly discusses this design choice and recommends controls and reliability measures for follow-up studies. revision: partial
Referee: [Evaluation and Discussion] No downstream task validation is performed. The generated SSyRSs are not tested for usability in activities such as test-case derivation, inconsistency detection, or traceability analysis, so the link between subjective realism ratings and practical utility remains unexamined.

Authors: We concur that downstream validation would strengthen claims about practical utility. Given the exploratory focus on generation and initial realism assessment, such tasks were outside the current scope. We will revise the Discussion section to state this limitation clearly and identify downstream validation as a key item for future work. revision: partial
Referee: [Abstract and Conclusion] The abstract and conclusion treat the 62% expert rating as support for partial success even while acknowledging contradictory statements and deficiencies in the outputs. A clearer reconciliation of these observations with the headline claim is required.

Authors: We will revise both the abstract and conclusion to more explicitly reconcile the 62% rating with the identified contradictions and deficiencies. The updated text will stress that the outputs are realistic only to a limited extent and that LLM-based assessments cannot substitute for expert evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external expert judgments

full rationale

The paper is an empirical case study that generates 300 SSyRSs via ChatGPT prompts and evaluates them through cross-model checks plus an external survey of n=87 experts yielding a 62% realism rate. No equations, fitted parameters, or derivations appear; the central quantitative claim is produced by independent human raters rather than any self-referential construction, self-citation load-bearing premise, or renaming of known results. The methodology (prompt patterns, iterative refinement) is described transparently and does not reduce to its own outputs by definition. This is the normal non-circular outcome for a survey-based generation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert survey responses constitute a sufficient proxy for realism and that prompt-based generation can be evaluated without reference to real proprietary documents.

axioms (1)

domain assumption Expert judgment on a survey is a valid and sufficient measure of whether a synthetic specification is realistic.
Invoked when interpreting the 62% figure as evidence of successful generation.

pith-pipeline@v0.9.0 · 5546 in / 1171 out tokens · 46115 ms · 2026-05-15T13:39:57.737459+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Iso/iec/ieee 29148:2018 — systems and software engineering — life cycle processes — requirements engineering

International Organization for Standardization, International Electrotechnical Commission, and Institute of Electrical and Electronics Engineers. Iso/iec/ieee 29148:2018 — systems and software engineering — life cycle processes — requirements engineering. Technical Report ISO/IEC/IEEE 29148:2018, ISO/IEC/IEEE, 2018

work page 2018
[2]

State of practice in requirements engineering: contemporary data.Innovations in Systems and Software Engineering, 10(4):235–241, 2014

Mohamad Kassab, Colin Neill, and Phillip Laplante. State of practice in requirements engineering: contemporary data.Innovations in Systems and Software Engineering, 10(4):235–241, 2014

work page 2014
[3]

A tertiary study on ai for requirements engineering

Ali Mehraj, Zheying Zhang, and Kari Systä. A tertiary study on ai for requirements engineering. In Daniel Mendez and Ana Moreira, editors,Requirements Engineering: Foundation for Software Quality, pages 159–177, Cham, 2024. Springer Nature Switzerland

work page 2024
[4]

Model-based testing for software safety: a systematic mapping study.Software Quality Journal, 26(4):1327–1372, 2018

Havva Gulay Gurbuz and Bedir Tekinerdogan. Model-based testing for software safety: a systematic mapping study.Software Quality Journal, 26(4):1327–1372, 2018

work page 2018
[5]

Ferrari, G

A. Ferrari, G. O. Spagnolo, and S. Gnesi. Pure: A dataset of public requirements documents. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 502–505, Lisbon, Portugal, 2017. IEEE

work page 2017
[6]

Demystifying chatgpt: An in-depth survey of openai’s robust large language models.Archives of Computational Methods in Engineering, 31(8):4557–4600, 2024

Pronaya Bhattacharya, Vivek Kumar Prasad, Ashwin Verma, Deepak Gupta, Assadaporn Sapsomboon, Wattana Viriyasitavat, and Gaurav Dhiman. Demystifying chatgpt: An in-depth survey of openai’s robust large language models.Archives of Computational Methods in Engineering, 31(8):4557–4600, 2024

work page 2024
[7]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), 2025

work page 2025
[8]

The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances, 2024

Allen Nie, Yash Chandak, Miroslav Suzara, Malika Ali, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances, 2024

work page 2024
[9]

Can ai assistants know what they don’t know?, 2024

Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. Can ai assistants know what they don’t know?, 2024. 14

work page 2024
[10]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023

work page 2023
[11]

Rothenberger, and Samir Chatterjee

Ken Peffers, Tuure Tuunanen, Marcus A. Rothenberger, and Samir Chatterjee. A design science research methodology for information systems research.Journal of Management Information Systems, 24(3):45–77, 2007

work page 2007
[12]

A survey on llm-as-a-judge, 2025

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025

work page 2025
[13]

Reasoning with large language models, a survey, 2024

Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large language models, a survey, 2024

work page 2024
[14]

Reducing hallucination in structured outputs via retrieval-augmented generation

Orlando Ayala and Patrice Bechard. Reducing hallucination in structured outputs via retrieval-augmented generation. In Yi Yang, Aida Davani, Avi Sil, and Anoop Kumar, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 228–2...

work page 2024
[15]

Mattukat, Florian M

Alex R. Mattukat, Florian M. Braun, and Horst Lichter. Can chatgpt generate realistic synthetic system requirement specifications? results of a case study, January 2026. Available at https://zenodo.org/records/16146453

work page arXiv 2026
[16]

Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, and Philip Resnik

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, G...

work page 2025
[17]

Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design, 2023

work page 2023
[18]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

work page 2019
[19]

Addison-Wesley, New Jersey, 2025

Len Bass, Qinghua Lu, Ingo Weber, and Liming Zhu.Engineering AI Systems: Architecture and DevOps Essentials. Addison-Wesley, New Jersey, 2025

work page 2025
[20]

Can you trust llm judgments? reliability of llm-as-a-judge, 2025

Kayla Schroeder and Zach Wood-Doughty. Can you trust llm judgments? reliability of llm-as-a-judge, 2025

work page 2025
[21]

D. J. Leiner. SoSci Survey (version 3.7.06) [computer software], 2025. Available at https://www.soscisurvey.de

work page 2025
[22]

Trust and reliance on ai — an experimental study on the extent and costs of overreliance on ai.Computers in Human Behavior, 160:108352, 2024

Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. Trust and reliance on ai — an experimental study on the extent and costs of overreliance on ai.Computers in Human Behavior, 160:108352, 2024

work page 2024
[23]

nl2spec: Interactively translating unstructured natural language to temporal logics with large language models

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. In Constantin Enea and Akash Lal, editors,Computer Aided Verification, pages 383–396, Cham, 2023. Springer Nature Switzerland

work page 2023
[24]

Experimenting a new programming practice with llms, 2024

Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. Experimenting a new programming practice with llms, 2024

work page 2024
[25]

Generating requirements out of thin air: Towards automated feature identification for new apps

Tahira Iqbal, Norbert Seyff, and Daniel Mendez. Generating requirements out of thin air: Towards automated feature identification for new apps. In2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pages 193–199, 2019

work page 2019
[26]

An automated model of software requirement engineering using gpt-3.5

Jie Sh’ng Yeow, Muhammad Ehsan Rana, and Nur Amira Abdul Majid. An automated model of software requirement engineering using gpt-3.5. In2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS), pages 1746–1755, 2024

work page 2024
[27]

Nlp4ref: Requirements classification and forecasting: From model-based design to large language models

Jordan Peer, Yaniv Mordecai, and Yoram Reich. Nlp4ref: Requirements classification and forecasting: From model-based design to large language models. In2024 IEEE Aerospace Conference, pages 1–16, 2024

work page 2024
[28]

Sayyad Shirabad and T

J. Sayyad Shirabad and T. J. Menzies. The PROMISE repository of software engineering databases. Technical report, School of Information Technology and Engineering, University of Ottawa, 2005. Accessed 19 July 2025. 15

work page 2005

[1] [1]

Iso/iec/ieee 29148:2018 — systems and software engineering — life cycle processes — requirements engineering

International Organization for Standardization, International Electrotechnical Commission, and Institute of Electrical and Electronics Engineers. Iso/iec/ieee 29148:2018 — systems and software engineering — life cycle processes — requirements engineering. Technical Report ISO/IEC/IEEE 29148:2018, ISO/IEC/IEEE, 2018

work page 2018

[2] [2]

State of practice in requirements engineering: contemporary data.Innovations in Systems and Software Engineering, 10(4):235–241, 2014

Mohamad Kassab, Colin Neill, and Phillip Laplante. State of practice in requirements engineering: contemporary data.Innovations in Systems and Software Engineering, 10(4):235–241, 2014

work page 2014

[3] [3]

A tertiary study on ai for requirements engineering

Ali Mehraj, Zheying Zhang, and Kari Systä. A tertiary study on ai for requirements engineering. In Daniel Mendez and Ana Moreira, editors,Requirements Engineering: Foundation for Software Quality, pages 159–177, Cham, 2024. Springer Nature Switzerland

work page 2024

[4] [4]

Model-based testing for software safety: a systematic mapping study.Software Quality Journal, 26(4):1327–1372, 2018

Havva Gulay Gurbuz and Bedir Tekinerdogan. Model-based testing for software safety: a systematic mapping study.Software Quality Journal, 26(4):1327–1372, 2018

work page 2018

[5] [5]

Ferrari, G

A. Ferrari, G. O. Spagnolo, and S. Gnesi. Pure: A dataset of public requirements documents. In2017 IEEE 25th International Requirements Engineering Conference (RE), pages 502–505, Lisbon, Portugal, 2017. IEEE

work page 2017

[6] [6]

Demystifying chatgpt: An in-depth survey of openai’s robust large language models.Archives of Computational Methods in Engineering, 31(8):4557–4600, 2024

Pronaya Bhattacharya, Vivek Kumar Prasad, Ashwin Verma, Deepak Gupta, Assadaporn Sapsomboon, Wattana Viriyasitavat, and Gaurav Dhiman. Demystifying chatgpt: An in-depth survey of openai’s robust large language models.Archives of Computational Methods in Engineering, 31(8):4557–4600, 2024

work page 2024

[7] [7]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), 2025

work page 2025

[8] [8]

The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances, 2024

Allen Nie, Yash Chandak, Miroslav Suzara, Malika Ali, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. The gpt surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances, 2024

work page 2024

[9] [9]

Can ai assistants know what they don’t know?, 2024

Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. Can ai assistants know what they don’t know?, 2024. 14

work page 2024

[10] [10]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023

work page 2023

[11] [11]

Rothenberger, and Samir Chatterjee

Ken Peffers, Tuure Tuunanen, Marcus A. Rothenberger, and Samir Chatterjee. A design science research methodology for information systems research.Journal of Management Information Systems, 24(3):45–77, 2007

work page 2007

[12] [12]

A survey on llm-as-a-judge, 2025

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025

work page 2025

[13] [13]

Reasoning with large language models, a survey, 2024

Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, and Thomas Back. Reasoning with large language models, a survey, 2024

work page 2024

[14] [14]

Reducing hallucination in structured outputs via retrieval-augmented generation

Orlando Ayala and Patrice Bechard. Reducing hallucination in structured outputs via retrieval-augmented generation. In Yi Yang, Aida Davani, Avi Sil, and Anoop Kumar, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pages 228–2...

work page 2024

[15] [15]

Mattukat, Florian M

Alex R. Mattukat, Florian M. Braun, and Horst Lichter. Can chatgpt generate realistic synthetic system requirement specifications? results of a case study, January 2026. Available at https://zenodo.org/records/16146453

work page arXiv 2026

[16] [16]

Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, and Philip Resnik

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, G...

work page 2025

[17] [17]

Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design, 2023

work page 2023

[18] [18]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

work page 2019

[19] [19]

Addison-Wesley, New Jersey, 2025

Len Bass, Qinghua Lu, Ingo Weber, and Liming Zhu.Engineering AI Systems: Architecture and DevOps Essentials. Addison-Wesley, New Jersey, 2025

work page 2025

[20] [20]

Can you trust llm judgments? reliability of llm-as-a-judge, 2025

Kayla Schroeder and Zach Wood-Doughty. Can you trust llm judgments? reliability of llm-as-a-judge, 2025

work page 2025

[21] [21]

D. J. Leiner. SoSci Survey (version 3.7.06) [computer software], 2025. Available at https://www.soscisurvey.de

work page 2025

[22] [22]

Trust and reliance on ai — an experimental study on the extent and costs of overreliance on ai.Computers in Human Behavior, 160:108352, 2024

Artur Klingbeil, Cassandra Grützner, and Philipp Schreck. Trust and reliance on ai — an experimental study on the extent and costs of overreliance on ai.Computers in Human Behavior, 160:108352, 2024

work page 2024

[23] [23]

nl2spec: Interactively translating unstructured natural language to temporal logics with large language models

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. In Constantin Enea and Akash Lal, editors,Computer Aided Verification, pages 383–396, Cham, 2023. Springer Nature Switzerland

work page 2023

[24] [24]

Experimenting a new programming practice with llms, 2024

Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. Experimenting a new programming practice with llms, 2024

work page 2024

[25] [25]

Generating requirements out of thin air: Towards automated feature identification for new apps

Tahira Iqbal, Norbert Seyff, and Daniel Mendez. Generating requirements out of thin air: Towards automated feature identification for new apps. In2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pages 193–199, 2019

work page 2019

[26] [26]

An automated model of software requirement engineering using gpt-3.5

Jie Sh’ng Yeow, Muhammad Ehsan Rana, and Nur Amira Abdul Majid. An automated model of software requirement engineering using gpt-3.5. In2024 ASU International Conference in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS), pages 1746–1755, 2024

work page 2024

[27] [27]

Nlp4ref: Requirements classification and forecasting: From model-based design to large language models

Jordan Peer, Yaniv Mordecai, and Yoram Reich. Nlp4ref: Requirements classification and forecasting: From model-based design to large language models. In2024 IEEE Aerospace Conference, pages 1–16, 2024

work page 2024

[28] [28]

Sayyad Shirabad and T

J. Sayyad Shirabad and T. J. Menzies. The PROMISE repository of software engineering databases. Technical report, School of Information Technology and Engineering, University of Ottawa, 2005. Accessed 19 July 2025. 15

work page 2005