On the Viability of Requirements Generation From Code: An Experience Report

Alexander Korn; Andreas Vogelsang; Jone Bartel; Max Unterbusch

arxiv: 2606.25550 · v1 · pith:SFECBIH6new · submitted 2026-06-24 · 💻 cs.SE

On the Viability of Requirements Generation From Code: An Experience Report

Alexander Korn , Jone Bartel , Max Unterbusch , Andreas Vogelsang This is my paper

Pith reviewed 2026-06-25 20:05 UTC · model grok-4.3

classification 💻 cs.SE

keywords requirements engineeringlarge language modelsdataset generationcode-to-requirementsrequirements smellshuman-in-the-loopexperience reportRAG

0 comments

The pith

LLMs cannot reliably generate requirements from code or introduce and detect smells in them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers tried to solve the shortage of datasets that link source code to requirements by using an LLM with retrieval-augmented generation to create requirements, verify which ones are implemented, and add synthetic problems such as smells. The tests revealed that the models could not produce non-implemented requirements on demand, could not produce high-quality requirements, and could not reliably insert smells. A single human reviewer also failed to catch the smells consistently. The authors therefore conclude that automated creation of such datasets with current LLMs is not viable and still requires substantial human supervision for quality control.

Core claim

The authors tested an LLM-based and RAG-supported agentic approach intended to generate requirements from source code, verify their implementation status with human assistance, and synthetically introduce requirements smells to mimic real datasets. Contrary to initial expectations, the LLM proved unable to generate non-implemented requirements reliably, unable to generate high-quality requirements, and unable to introduce synthetic requirements smells in a controlled way. Neither the LLM nor a single human-in-the-loop reviewer could detect requirements smells reliably. The authors therefore state that the generation of code-to-requirements datasets using LLMs is not yet viable and requires h

What carries the argument

LLM-based and RAG-supported agentic system that generates requirements from code, checks implementation status via human-in-the-loop, and injects synthetic requirements smells.

If this is right

Code-to-requirements datasets cannot be created reliably with current LLMs without extensive human supervision for quality assurance.
Empirical research in requirements engineering will continue to lack adequate paired datasets under current LLM methods.
Requirements smells cannot be introduced or detected reliably by the tested LLM or by one human reviewer.
Generating non-implemented requirements from existing code remains outside the reliable capability of LLMs.
High-quality requirements generation from code is not achievable with the evaluated LLM and RAG setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Multiple human reviewers working together might improve smell detection where a single reviewer fails.
Similar reliability problems could appear in other tasks that ask LLMs to reverse-engineer specifications from code.
Researchers may need to invest more effort in collecting authentic paired datasets rather than attempting synthetic generation.
Tool builders for requirements-from-code tasks should plan for validation steps beyond single LLM or single-reviewer checks.

Load-bearing premise

The failures seen with the tested LLM, RAG setup, and single human reviewer indicate fundamental limits that hold beyond the specific code bases and choices used in the experiments.

What would settle it

A replication that uses a different LLM or multiple human reviewers and succeeds at producing high-quality requirements, non-implemented requirements, and controlled synthetic smells would show the approach can be viable.

Figures

Figures reproduced from arXiv: 2606.25550 by Alexander Korn, Andreas Vogelsang, Jone Bartel, Max Unterbusch.

**Figure 2.** Figure 2: Agreement of human evaluators (after peer-review) with LLM on the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of human-in-the-loop appropriateness ratings over time. Each batch is generated with reviews from last batch as shot samples. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Empirical research in Requirements Engineering is hampered by a lack of adequate datasets that pair source code with corresponding requirements. A tempting route to addressing this lack is the use of Large Language Models to synthesize requirements from existing code bases. We investigate this question by evaluating an LLM-based and RAG-supported agentic approach that generates requirements from source code, verifies their implementation status relying on a human-in-the-loop, and synthetically introduces requirements smells and non-implemented requirements. Our goal was to create datasets that mimic reality and foster empirical RE research. However, during the study, various problems arose, leading to this experience report. Contrary to our initial hypotheses, LLMs were unable to (i) generate non-implemented requirements reliably, (ii) generate high quality requirements, and (iii) reliably introduce synthetic requirements smells. Furthermore, neither an LLM nor a single human-in-the-loop suffices to detect requirements smells reliably. These findings suggest that the generation of code-to-requirements datasets using LLMs is not yet viable and requires human supervision, especially for quality assurance. We critically reflect on our lessons learned and draw relevant conclusions for both researchers and practitioners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This experience report documents concrete failures in one LLM+RAG workflow for code-to-requirements but the viability conclusion rests on limited evidence from a single unablated setup.

read the letter

The main thing to know is that the authors ran an agentic LLM setup with RAG to generate requirements from code, including non-implemented ones and synthetic smells, and it fell short on reliability, quality, and smell injection. Detection by the model or a single human reviewer also proved unreliable, leading them to say this route to dataset creation is not yet viable without heavy human oversight.

The paper does a solid job reporting what actually happened in their workflow. It lays out the specific problems they encountered and reflects on lessons for both researchers and practitioners. That kind of direct account is useful in requirements engineering, where people often need to know the practical barriers before investing time in similar experiments.

The soft spot is the jump from these observations to the broader claim. The results come from their particular models, prompts, code bases, and review process. Without ablations or tests on other configurations, it is hard to tell how much is specific to this setup versus a general limit of current LLMs. The stress-test note correctly flags that the generalization step is not strongly supported.

This paper is for requirements engineering researchers who are thinking about using LLMs to build paired code-requirements datasets. A reader in that niche will get a clear picture of the issues that arose. The work shows straightforward engagement with the task and the relevant literature.

I would send it for peer review. It is a straightforward experience report with observable findings that could save others time, even if the scope stays narrow.

Referee Report

2 major / 1 minor

Summary. The paper is an experience report describing an attempt to use an LLM-based, RAG-supported agentic workflow to generate requirements from source code, verify implementation status via human-in-the-loop, and synthetically introduce requirements smells and non-implemented requirements. Contrary to the authors' initial hypotheses, the tested approach failed to reliably produce non-implemented requirements, high-quality requirements, or synthetic smells, and neither the LLM nor a single human reviewer could reliably detect smells. The authors conclude that LLM-driven generation of code-to-requirements datasets is not yet viable and requires substantial human supervision for quality assurance.

Significance. If the reported failures are representative, the work provides a cautionary data point for the requirements engineering community on the current limitations of LLMs for synthetic dataset creation, underscoring the need for hybrid human-AI processes and potentially discouraging over-optimistic automation in empirical RE research.

major comments (2)

[Abstract, §1] Abstract and §1: The central claim that LLM-based requirements generation 'is not yet viable' generalizes from observations on a single LLM, RAG configuration, and set of code bases; no ablation studies, alternative models, or prompt variations are reported to support moving from 'this workflow failed' to a broader viability assessment.
[Abstract] Abstract: The viability conclusion and the three enumerated failures are stated without any quantitative metrics, failure rates, or controlled baselines, leaving the severity of the problems and the strength of the negative result difficult to calibrate.

minor comments (1)

The manuscript would benefit from explicit discussion of the specific LLM and RAG implementation details (e.g., model version, retrieval parameters) to allow readers to assess reproducibility of the reported workflow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help us better position this experience report. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract, §1] Abstract and §1: The central claim that LLM-based requirements generation 'is not yet viable' generalizes from observations on a single LLM, RAG configuration, and set of code bases; no ablation studies, alternative models, or prompt variations are reported to support moving from 'this workflow failed' to a broader viability assessment.

Authors: We agree that the study examines only one specific LLM-based, RAG-supported agentic workflow on a limited set of code bases. As an experience report, the manuscript documents the concrete difficulties encountered rather than claiming that every possible LLM configuration must fail. We will revise the abstract and §1 to explicitly bound the conclusions to the tested workflow and to present the results as lessons learned from this particular attempt, thereby reducing any implication of broader generalization. Additional ablation studies or alternative-model experiments lie outside the scope of an experience report and are not planned. revision: partial
Referee: [Abstract] Abstract: The viability conclusion and the three enumerated failures are stated without any quantitative metrics, failure rates, or controlled baselines, leaving the severity of the problems and the strength of the negative result difficult to calibrate.

Authors: The paper is an experience report that records qualitative observations of repeated failures (inability to generate non-implemented requirements, high-quality specifications, and reliable synthetic smells). We will expand the abstract to include brief descriptive statements on the consistency and pervasiveness of these issues across the code bases examined, thereby providing readers with a clearer sense of scale without converting the work into a quantitative study. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from direct experiments

full rationale

The paper is an experience report whose central claims (LLM inability to generate non-implemented requirements, high-quality output, or synthetic smells; insufficiency of LLM or single human reviewer for smell detection) are stated as direct outcomes of the authors' own experimental runs on chosen code bases. No equations, fitted parameters, or self-citations are used to derive these results; the viability conclusion follows from the reported failures rather than from any self-referential definition or prior-author uniqueness theorem. The generalization concern raised by the skeptic is a question of external validity, not a circular reduction of the reported findings to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard empirical assumptions in software engineering without introducing fitted parameters or new entities.

axioms (1)

domain assumption The chosen code bases, LLM models, and single human reviewer are representative for assessing the general viability of LLM-based requirements generation.
The viability conclusion is extrapolated from the specific instances tested during the study.

pith-pipeline@v0.9.1-grok · 5735 in / 1344 out tokens · 39353 ms · 2026-06-25T20:05:32.900090+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 1 canonical work pages

[1]

Towards FAIR principles for research software,

A.-L. Lamprecht, L. Garcia, M. Kuzak, C. Martinez, R. Arcila, E. Martin Del Pico, V . Dominguez Del Angel, S. van de Sandt, J. Ison, P. A. Martinez, P. McQuilton, A. Valencia, J. Harrow, F. Psomopoulos, J. L. Gelpi, N. Chue Hong, C. Goble, and S. Capella-Gutierrez, “Towards FAIR principles for research software,”Data Science, vol. 3, no. 1, pp. 37–59, 2019

2019
[2]

Assessing the impact of requirement ambiguity on LLM-based function-level code generation,

D. Yang, X. Xie, X. Yang, M. Hu, Y . Huang, Y . Zhang, W. Miao, T. Su, C. Wan, and G. Pu, “Assessing the impact of requirement ambiguity on LLM-based function-level code generation,”arXiv preprint arXiv:2604.21505, 2026

Pith/arXiv arXiv 2026
[3]

J. L. C. Guo, J.-P. Steghöfer, A. V ogelsang, and J. Cleland-Huang, Natural Language Processing for Requirements Traceability. Springer Nature Switzerland, 2025, pp. 89–116

2025
[4]

Characterizing datasets for LLM-based requirements engineering: A systematic mapping study,

Q. Motger, C. Catot, and X. Franch, “Characterizing datasets for LLM-based requirements engineering: A systematic mapping study,” inInternational Conference on Evaluation and Assessment in Software Engineering (EASE). ACM, 2026

2026
[5]

A survey on large language models for code generation,

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 35, no. 2, pp. 1–72, 2026

2026
[6]

Synthline: A product line approach for synthetic requirements engineering data generation using large language models,

A. El-Hajjami and C. Salinesi, “Synthline: A product line approach for synthetic requirements engineering data generation using large language models,” inInternational Conference on Research Challenges in Information Science. Springer, 2025, pp. 208–225

2025
[7]

Bridging the gap between source code and requirements using GPT (student abstract),

R. Xu, Z. Xu, G. Li, and V . S. Sheng, “Bridging the gap between source code and requirements using GPT (student abstract),” inAAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 686– 23 687

2024
[8]

Intelligent traceability to support software main- tainability and accountability,

K. R. Dearstyne, “Intelligent traceability to support software main- tainability and accountability,” inIEEE International Requirements Engineering Conference (RE). IEEE, 2025, pp. 607–611

2025
[9]

R2Code: A self-reflective LLM framework for requirements-to-code traceability,

Y . Wang, J. Keung, X. Ma, Z. Mao, K. Chen, and Y . Li, “R2Code: A self-reflective LLM framework for requirements-to-code traceability,” arXiv preprint arXiv:2604.22432, 2026

Pith/arXiv arXiv 2026
[10]

Rapid quality assurance with requirements smells,

H. Femmer, D. Méndez Fernández, S. Wagner, and S. Eder, “Rapid quality assurance with requirements smells,”Journal of Systems and Software (JSS), vol. 123, pp. 190–213, 2017

2017
[11]

Characterizing requirements smells,

E. Gentili and D. Falessi, “Characterizing requirements smells,” inInter- national Conference on Product-Focused Software Process Improvement (PROFES). Springer, 2023, pp. 387–398

2023
[12]

Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality,

W. Zogaan, P. Sharma, M. Mirahkorli, and V . Arnaoudova, “Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality,” inIEEE International Re- quirements Engineering Conference (RE). IEEE, 2017, pp. 110–121

2017
[13]

UserTrace: User-level requirements generation and traceability recovery from software project repositories,

D. Jin, Z. Jin, Y . Zhang, Z. Fang, L. Li, Y . He, X. Chen, and W. Sun, “UserTrace: User-level requirements generation and traceability recovery from software project repositories,” 2025

2025
[14]

LiSSA: Toward generic traceability link recovery through retrieval- augmented generation,

D. Fuchß, T. Hey, J. Keim, H. Liu, N. Ewald, T. Thirolf, and A. Koziolek, “LiSSA: Toward generic traceability link recovery through retrieval- augmented generation,” inIEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 1396–1408

2025
[15]

Improving the effectiveness of traceability link recovery using hierarchical bayesian networks,

K. Moran, D. N. Palacio, C. Bernal-Cárdenas, D. McCrystal, D. Poshy- vanyk, C. Shenefiel, and J. Johnson, “Improving the effectiveness of traceability link recovery using hierarchical bayesian networks,” in ACM/IEEE International Conference on Software Engineering (ICSE). ACM, 2020, pp. 873–885

2020
[16]

Establishing traceability between natural language requirements and software artifacts by combining RAG and LLMs,

S. J. Ali, V . Naganathan, and D. Bork, “Establishing traceability between natural language requirements and software artifacts by combining RAG and LLMs,” inInternational Conference on Conceptual Modeling (ER). Springer, 2024, pp. 295–314

2024
[17]

Improving traceability link recovery using fine-grained requirements-to-code relations,

T. Hey, F. Chen, S. Weigelt, and W. F. Tichy, “Improving traceability link recovery using fine-grained requirements-to-code relations,” in IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2021, pp. 12–22

2021
[18]

Traceability transformed: Generating more accurate links with pre-trained BERT models,

J. Lin, Y . Liu, Q. Zeng, M. Jiang, and J. Cleland-Huang, “Traceability transformed: Generating more accurate links with pre-trained BERT models,” inIEEE/ACM International Conference on Software Engineer- ing (ICSE). IEEE, 2021, pp. 324–335

2021
[19]

Code2Req: Using generative AI to generate requirements from source code,

E. Persson, E. Alégroth, and T. Gorschek, “Code2Req: Using generative AI to generate requirements from source code,” 2025. [Online]. Available: http://dx.doi.org/10.2139/ssrn.5845431

work page doi:10.2139/ssrn.5845431 2025
[20]

Scaling retrieval-based language models with a trillion- token datastore,

R. Shao, J. He, A. Asai, W. Shi, T. Dettmers, S. Min, L. Zettlemoyer, and P. W. Koh, “Scaling retrieval-based language models with a trillion- token datastore,”Advances in Neural Information Processing Systems, vol. 37, pp. 91 260–91 299, 2024

2024
[21]

On the impact of requirements smells in prompts: The case of automated traceability,

A. V ogelsang, A. Korn, G. Broccia, A. Ferrari, J. Fischbach, and C. Arora, “On the impact of requirements smells in prompts: The case of automated traceability,” inIEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2025, pp. 51–55

2025
[22]

Context-adaptive requirements defect prediction through human-LLM collaboration,

M. Unterbusch and A. V ogelsang, “Context-adaptive requirements defect prediction through human-LLM collaboration,” inIEEE/ACM Interna- tional Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2026

2026
[23]

Reporting guidelines for controlled ex- periments in software engineering,

A. Jedlitschka and D. Pfahl, “Reporting guidelines for controlled ex- periments in software engineering,” inInternational Symposium on Empirical Software Engineering (ESEM), 2005

2005
[24]

LLM- based class diagram derivation from user stories with chain-of-thought promptings,

Y . Li, J. Keung, X. Ma, C. Y . Chong, J. Zhang, and Y . Liao, “LLM- based class diagram derivation from user stories with chain-of-thought promptings,” inIEEE 48th Annual Computers, Software, and Applica- tions Conference (COMPSAC). IEEE, 2024, pp. 45–50

2024
[25]

Requirements Quality Is Quality in Use,

H. Femmer and A. V ogelsang, “Requirements Quality Is Quality in Use,” IEEE Software, vol. 36, no. 3, pp. 83–91, May 2019

2019
[26]

Identifying Relevant Factors of Requirements Quality: An Industrial Case Study,

J. Frattini, “Identifying Relevant Factors of Requirements Quality: An Industrial Case Study,” inRequirements Engineering: Foundation for Software Quality, D. Mendez and A. Moreira, Eds. Cham: Springer Nature Switzerland, 2024, pp. 20–36

2024
[27]

Reporting LLM prompting in automated software engineering: A guideline based on current practices and expectations,

A. Korn, L. Zaruchas, C. Arora, A. Metzger, S. Smolka, F. Wang, and A. V ogelsang, “Reporting LLM prompting in automated software engineering: A guideline based on current practices and expectations,” in ACM International Conference on AI Foundation Models and Software Engineering (FORGE), 2026

2026

[1] [1]

Towards FAIR principles for research software,

A.-L. Lamprecht, L. Garcia, M. Kuzak, C. Martinez, R. Arcila, E. Martin Del Pico, V . Dominguez Del Angel, S. van de Sandt, J. Ison, P. A. Martinez, P. McQuilton, A. Valencia, J. Harrow, F. Psomopoulos, J. L. Gelpi, N. Chue Hong, C. Goble, and S. Capella-Gutierrez, “Towards FAIR principles for research software,”Data Science, vol. 3, no. 1, pp. 37–59, 2019

2019

[2] [2]

Assessing the impact of requirement ambiguity on LLM-based function-level code generation,

D. Yang, X. Xie, X. Yang, M. Hu, Y . Huang, Y . Zhang, W. Miao, T. Su, C. Wan, and G. Pu, “Assessing the impact of requirement ambiguity on LLM-based function-level code generation,”arXiv preprint arXiv:2604.21505, 2026

Pith/arXiv arXiv 2026

[3] [3]

J. L. C. Guo, J.-P. Steghöfer, A. V ogelsang, and J. Cleland-Huang, Natural Language Processing for Requirements Traceability. Springer Nature Switzerland, 2025, pp. 89–116

2025

[4] [4]

Characterizing datasets for LLM-based requirements engineering: A systematic mapping study,

Q. Motger, C. Catot, and X. Franch, “Characterizing datasets for LLM-based requirements engineering: A systematic mapping study,” inInternational Conference on Evaluation and Assessment in Software Engineering (EASE). ACM, 2026

2026

[5] [5]

A survey on large language models for code generation,

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 35, no. 2, pp. 1–72, 2026

2026

[6] [6]

Synthline: A product line approach for synthetic requirements engineering data generation using large language models,

A. El-Hajjami and C. Salinesi, “Synthline: A product line approach for synthetic requirements engineering data generation using large language models,” inInternational Conference on Research Challenges in Information Science. Springer, 2025, pp. 208–225

2025

[7] [7]

Bridging the gap between source code and requirements using GPT (student abstract),

R. Xu, Z. Xu, G. Li, and V . S. Sheng, “Bridging the gap between source code and requirements using GPT (student abstract),” inAAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 686– 23 687

2024

[8] [8]

Intelligent traceability to support software main- tainability and accountability,

K. R. Dearstyne, “Intelligent traceability to support software main- tainability and accountability,” inIEEE International Requirements Engineering Conference (RE). IEEE, 2025, pp. 607–611

2025

[9] [9]

R2Code: A self-reflective LLM framework for requirements-to-code traceability,

Y . Wang, J. Keung, X. Ma, Z. Mao, K. Chen, and Y . Li, “R2Code: A self-reflective LLM framework for requirements-to-code traceability,” arXiv preprint arXiv:2604.22432, 2026

Pith/arXiv arXiv 2026

[10] [10]

Rapid quality assurance with requirements smells,

H. Femmer, D. Méndez Fernández, S. Wagner, and S. Eder, “Rapid quality assurance with requirements smells,”Journal of Systems and Software (JSS), vol. 123, pp. 190–213, 2017

2017

[11] [11]

Characterizing requirements smells,

E. Gentili and D. Falessi, “Characterizing requirements smells,” inInter- national Conference on Product-Focused Software Process Improvement (PROFES). Springer, 2023, pp. 387–398

2023

[12] [12]

Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality,

W. Zogaan, P. Sharma, M. Mirahkorli, and V . Arnaoudova, “Datasets from fifteen years of automated requirements traceability research: Current state, characteristics, and quality,” inIEEE International Re- quirements Engineering Conference (RE). IEEE, 2017, pp. 110–121

2017

[13] [13]

UserTrace: User-level requirements generation and traceability recovery from software project repositories,

D. Jin, Z. Jin, Y . Zhang, Z. Fang, L. Li, Y . He, X. Chen, and W. Sun, “UserTrace: User-level requirements generation and traceability recovery from software project repositories,” 2025

2025

[14] [14]

LiSSA: Toward generic traceability link recovery through retrieval- augmented generation,

D. Fuchß, T. Hey, J. Keim, H. Liu, N. Ewald, T. Thirolf, and A. Koziolek, “LiSSA: Toward generic traceability link recovery through retrieval- augmented generation,” inIEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 1396–1408

2025

[15] [15]

Improving the effectiveness of traceability link recovery using hierarchical bayesian networks,

K. Moran, D. N. Palacio, C. Bernal-Cárdenas, D. McCrystal, D. Poshy- vanyk, C. Shenefiel, and J. Johnson, “Improving the effectiveness of traceability link recovery using hierarchical bayesian networks,” in ACM/IEEE International Conference on Software Engineering (ICSE). ACM, 2020, pp. 873–885

2020

[16] [16]

Establishing traceability between natural language requirements and software artifacts by combining RAG and LLMs,

S. J. Ali, V . Naganathan, and D. Bork, “Establishing traceability between natural language requirements and software artifacts by combining RAG and LLMs,” inInternational Conference on Conceptual Modeling (ER). Springer, 2024, pp. 295–314

2024

[17] [17]

Improving traceability link recovery using fine-grained requirements-to-code relations,

T. Hey, F. Chen, S. Weigelt, and W. F. Tichy, “Improving traceability link recovery using fine-grained requirements-to-code relations,” in IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2021, pp. 12–22

2021

[18] [18]

Traceability transformed: Generating more accurate links with pre-trained BERT models,

J. Lin, Y . Liu, Q. Zeng, M. Jiang, and J. Cleland-Huang, “Traceability transformed: Generating more accurate links with pre-trained BERT models,” inIEEE/ACM International Conference on Software Engineer- ing (ICSE). IEEE, 2021, pp. 324–335

2021

[19] [19]

Code2Req: Using generative AI to generate requirements from source code,

E. Persson, E. Alégroth, and T. Gorschek, “Code2Req: Using generative AI to generate requirements from source code,” 2025. [Online]. Available: http://dx.doi.org/10.2139/ssrn.5845431

work page doi:10.2139/ssrn.5845431 2025

[20] [20]

Scaling retrieval-based language models with a trillion- token datastore,

R. Shao, J. He, A. Asai, W. Shi, T. Dettmers, S. Min, L. Zettlemoyer, and P. W. Koh, “Scaling retrieval-based language models with a trillion- token datastore,”Advances in Neural Information Processing Systems, vol. 37, pp. 91 260–91 299, 2024

2024

[21] [21]

On the impact of requirements smells in prompts: The case of automated traceability,

A. V ogelsang, A. Korn, G. Broccia, A. Ferrari, J. Fischbach, and C. Arora, “On the impact of requirements smells in prompts: The case of automated traceability,” inIEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2025, pp. 51–55

2025

[22] [22]

Context-adaptive requirements defect prediction through human-LLM collaboration,

M. Unterbusch and A. V ogelsang, “Context-adaptive requirements defect prediction through human-LLM collaboration,” inIEEE/ACM Interna- tional Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2026

2026

[23] [23]

Reporting guidelines for controlled ex- periments in software engineering,

A. Jedlitschka and D. Pfahl, “Reporting guidelines for controlled ex- periments in software engineering,” inInternational Symposium on Empirical Software Engineering (ESEM), 2005

2005

[24] [24]

LLM- based class diagram derivation from user stories with chain-of-thought promptings,

Y . Li, J. Keung, X. Ma, C. Y . Chong, J. Zhang, and Y . Liao, “LLM- based class diagram derivation from user stories with chain-of-thought promptings,” inIEEE 48th Annual Computers, Software, and Applica- tions Conference (COMPSAC). IEEE, 2024, pp. 45–50

2024

[25] [25]

Requirements Quality Is Quality in Use,

H. Femmer and A. V ogelsang, “Requirements Quality Is Quality in Use,” IEEE Software, vol. 36, no. 3, pp. 83–91, May 2019

2019

[26] [26]

Identifying Relevant Factors of Requirements Quality: An Industrial Case Study,

J. Frattini, “Identifying Relevant Factors of Requirements Quality: An Industrial Case Study,” inRequirements Engineering: Foundation for Software Quality, D. Mendez and A. Moreira, Eds. Cham: Springer Nature Switzerland, 2024, pp. 20–36

2024

[27] [27]

Reporting LLM prompting in automated software engineering: A guideline based on current practices and expectations,

A. Korn, L. Zaruchas, C. Arora, A. Metzger, S. Smolka, F. Wang, and A. V ogelsang, “Reporting LLM prompting in automated software engineering: A guideline based on current practices and expectations,” in ACM International Conference on AI Foundation Models and Software Engineering (FORGE), 2026

2026