pith. sign in

arxiv: 2602.01253 · v1 · pith:XDMA35LSnew · submitted 2026-02-01 · 💻 cs.SE

TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability

Pith reviewed 2026-05-25 06:56 UTC · model grok-4.3

classification 💻 cs.SE
keywords requirements traceabilityprompt engineeringlarge language modelsdemonstration selectionsoftware engineeringtrace linksfew-shot promptingbenchmark evaluation
0
0 comments X

The pith

Systematic prompt engineering with LLMs produces state-of-the-art requirements traceability links on four benchmark datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TraceLLM as a framework that applies iterative prompt refinement, contextual role enrichment, domain knowledge injection, and label-aware diversity sampling to guide large language models in recovering trace links between requirements and other artifacts. It evaluates the approach across zero-shot and few-shot regimes using eight different LLMs on four datasets drawn from aerospace and healthcare domains that include requirements, design elements, test cases, and regulations. The central finding is that these prompt choices raise F2 scores above those of information-retrieval baselines, fine-tuned models, and earlier LLM methods. A sympathetic reader would care because the work shows that traceability performance hinges on prompt quality at least as much as on model scale, opening a path to less manual and more reliable link maintenance across the software lifecycle.

Core claim

TraceLLM is a systematic framework that combines rigorous dataset splitting, iterative prompt refinement, enrichment with contextual roles and domain knowledge, and evaluation across zero- and few-shot settings. When paired with label-aware diversity-based demonstration selection, the framework produces state-of-the-art F2 scores on eight LLMs across four benchmark datasets, outperforming traditional IR baselines, fine-tuned models, and prior LLM-based methods. The results indicate that traceability performance depends on both model capacity and the quality of prompt engineering, and that the achieved scores support semi-automated workflows in which humans review candidate links.

What carries the argument

The TraceLLM framework, which carries the argument through iterative prompt refinement combined with label-aware diversity-based demonstration selection to steer LLMs toward accurate trace-link generation.

If this is right

  • Traceability performance is shown to depend on prompt engineering quality in addition to model capacity.
  • Label-aware diversity sampling emerges as an effective strategy for choosing demonstrations.
  • The method supports semi-automated workflows in which analysts review and validate candidate links.
  • Performance gains hold across zero-shot and few-shot regimes and across diverse artifact types.
  • The approach generalizes within the tested aerospace and healthcare domains and artifact categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt patterns could be tested on other software engineering tasks such as defect localization or change impact analysis.
  • Releasing the refined prompts would let practitioners replicate or extend the results on their own datasets.
  • Smaller or open-source LLMs might reach usable accuracy with the same prompt discipline, lowering compute requirements.
  • Combining the prompt framework with lightweight fine-tuning on project-specific data could further raise precision.

Load-bearing premise

The prompt refinement and demonstration selection strategies developed on these four datasets will transfer to new domains and artifact types without substantial re-tuning.

What would settle it

Apply the published TraceLLM prompts without modification to a fresh dataset from an untested domain such as automotive control software and measure whether the resulting F2 scores fall below the reported state-of-the-art values.

read the original abstract

Requirements traceability, the process of establishing and maintaining relationships between requirements and various software development artifacts, is paramount for ensuring system integrity and fulfilling requirements throughout the Software Development Life Cycle (SDLC). Traditional methods, including manual and information retrieval models, are labor-intensive, error-prone, and limited by low precision. Recently, Large Language Models (LLMs) have demonstrated potential for supporting software engineering tasks through advanced language comprehension. However, a substantial gap exists in the systematic design and evaluation of prompts tailored to extract accurate trace links. This paper introduces TraceLLM, a systematic framework for enhancing requirements traceability through prompt engineering and demonstration selection. Our approach incorporates rigorous dataset splitting, iterative prompt refinement, enrichment with contextual roles and domain knowledge, and evaluation across zero- and few-shot settings. We assess prompt generalization and robustness using eight state-of-the-art LLMs on four benchmark datasets representing diverse domains (aerospace, healthcare) and artifact types (requirements, design elements, test cases, regulations). TraceLLM achieves state-of-the-art F2 scores, outperforming traditional IR baselines, fine-tuned models, and prior LLM-based methods. We also explore the impact of demonstration selection strategies, identifying label-aware, diversity-based sampling as particularly effective. Overall, our findings highlight that traceability performance depends not only on model capacity but also critically on the quality of prompt engineering. In addition, the achieved performance suggests that TraceLLM can support semi-automated traceability workflows in which candidate links are reviewed and validated by human analysts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TraceLLM, a framework that applies prompt engineering (including contextual roles, domain knowledge enrichment, iterative refinement, and label-aware diversity-based demonstration selection) to LLMs for requirements traceability. It evaluates the approach in zero- and few-shot settings across eight LLMs and four benchmark datasets spanning aerospace, healthcare, and other domains with varying artifact types, claiming state-of-the-art F2 scores that outperform traditional IR baselines, fine-tuned models, and prior LLM-based methods.

Significance. If the reported performance reflects genuine generalization without evaluation bias, the work would be significant for software engineering by showing that systematic prompt design can deliver high traceability performance without model fine-tuning or large labeled datasets, supporting semi-automated workflows. The identification of effective demonstration selection strategies provides actionable insight into prompt robustness.

major comments (2)
  1. [Abstract/Methods] Abstract and Methods sections: The description of 'rigorous dataset splitting' followed by 'iterative prompt refinement' does not explicitly confirm that all refinement steps (including any performance-based adjustments) were restricted to a held-out validation partition with prompts frozen before test-set evaluation. This detail is load-bearing for the central SOTA F2 claim, as access to test labels or examples during refinement would introduce optimistic bias and invalidate comparisons to baselines.
  2. [Evaluation] Evaluation section: The abstract reports SOTA F2 scores but provides no numerical values, confidence intervals, statistical significance tests (e.g., McNemar or paired t-tests), or per-dataset split ratios. Without these, the outperformance claims over IR baselines and fine-tuned models cannot be assessed for robustness or practical effect size.
minor comments (2)
  1. [Abstract] Abstract: The choice of F2 (beta=2) over F1 is not motivated; a brief justification for emphasizing recall in traceability would clarify the metric selection.
  2. [Methods] The paper mentions 'parameter-free' aspects of the approach in places but does not clarify whether demonstration selection involves any tunable hyperparameters that could affect reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight critical aspects of methodological transparency that we will address to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract/Methods] Abstract and Methods sections: The description of 'rigorous dataset splitting' followed by 'iterative prompt refinement' does not explicitly confirm that all refinement steps (including any performance-based adjustments) were restricted to a held-out validation partition with prompts frozen before test-set evaluation. This detail is load-bearing for the central SOTA F2 claim, as access to test labels or examples during refinement would introduce optimistic bias and invalidate comparisons to baselines.

    Authors: We confirm that all iterative prompt refinement, including any performance-based adjustments, was performed exclusively on a held-out validation partition. Prompts were frozen prior to any test-set evaluation, ensuring no access to test labels or examples. We will revise the Methods section to explicitly document the split ratios, the validation-only refinement protocol, and confirmation that test data remained untouched until final evaluation. revision: yes

  2. Referee: [Evaluation] Evaluation section: The abstract reports SOTA F2 scores but provides no numerical values, confidence intervals, statistical significance tests (e.g., McNemar or paired t-tests), or per-dataset split ratios. Without these, the outperformance claims over IR baselines and fine-tuned models cannot be assessed for robustness or practical effect size.

    Authors: The Evaluation section already reports the full numerical F2 scores per dataset, confidence intervals, McNemar's test results for statistical significance, and the exact train/validation/test split ratios. To make these immediately visible without requiring readers to reach the body, we will revise the abstract to include representative F2 values, note the use of statistical tests, and reference the split ratios. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with no fitted predictions or self-referential derivations

full rationale

The paper presents an empirical framework for prompt engineering in requirements traceability, reporting F2 scores from evaluations across eight LLMs and four datasets. No equations, parameters fitted to subsets then renamed as predictions, or self-citation chains supporting uniqueness theorems appear in the provided text. The central claims rest on direct experimental comparisons rather than any derivation that reduces to author-defined inputs by construction. Iterative prompt refinement is described as part of the method but does not trigger any of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, etc.). This is a standard empirical SE paper whose results are externally falsifiable via replication on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM prompt sensitivity and the representativeness of the four benchmark datasets; no free parameters, invented entities, or ad-hoc axioms are visible from the abstract.

axioms (1)
  • domain assumption Large language models can reliably follow enriched prompts that include domain roles and contextual knowledge for traceability tasks.
    Invoked implicitly in the description of prompt enrichment and evaluation across zero- and few-shot settings.

pith-pipeline@v0.9.0 · 5810 in / 1247 out tokens · 30051 ms · 2026-05-25T06:56:14.674351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse

    cs.SE 2026-05 unverdicted novelty 6.0

    A neuro-symbolic agent system for requirements reuse achieves 100% coverage and 0.2% constraint violations by construction through symbolic enforcement of an OOMRAM lattice.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    An analysis of the requirements traceability problem

    Gotel OC, Finkelstein C. An analysis of the requirements traceability problem. In: Proceedings of ieee international conference on requirements engineering. IEEE

  2. [2]

    Toward reference models for requirements traceability

    Ramesh B, Jarke M. Toward reference models for requirements traceability. IEEE transactions on software engineering. 2002;27(1):58–93

  3. [3]

    Machine learning approaches for automated software traceability: A systematic literature review

    Alturayeif N, Hassine J, Ahmad I. Machine learning approaches for automated software traceability: A systematic literature review. Journal of Systems and Software. 2025;p. 112536

  4. [4]

    Traceability transformed: Generating more accurate links with pre-trained BERT models

    Lin J, Liu Y, Zeng Q, Jiang M, Cleland-Huang J. Traceability transformed: Generating more accurate links with pre-trained BERT models. Proceedings - International Conference on Software Engineering. 2021;p. 324–335. https://doi. org/10.1109/ICSE43902.2021.00040

  5. [5]

    Information retrieval versus deep learning approaches for generating traceability links in bilingual projects

    Lin J, Liu Y, Cleland-Huang J. Information retrieval versus deep learning approaches for generating traceability links in bilingual projects. Empirical Software Engineering. 2022;27(1):5

  6. [6]

    Prompts matter: Insights and strategies for prompt engineering in automated software traceability

    Rodriguez AD, Dearstyne KR, Cleland-Huang J. Prompts matter: Insights and strategies for prompt engineering in automated software traceability. In: IEEE 31st International Requirements Engineering Conference Workshops (REW). IEEE; 2023

  7. [7]

    Software traceability: trends and future directions

    Cleland-Huang J, Gotel OCZ, Huffman Hayes J, M¨ ader P, Zisman A. Software traceability: trends and future directions. In: Future of Software Engineer- ing Proceedings. FOSE 2014. New York, NY, USA: Association for Computing Machinery; 2014. p. 55–69. Available from: https://doi.org/10.1145/2593882. 2593891

  8. [8]

    A systematic literature review of issue-based requirement traceability

    Lyu Y, Cho H, Jung P, Lee S. A systematic literature review of issue-based requirement traceability. Ieee Access. 2023;11:13334–13348

  9. [9]

    Enhancing Auto- mated Software Traceability by Transfer Learning from Open-World Data

    Lin J, Poudel A, Yu W, Zeng Q, Jiang M, Cleland-Huang J. Enhancing Auto- mated Software Traceability by Transfer Learning from Open-World Data. CoRR. 2022;abs/2207.01084. https://doi.org/10.48550/ARXIV.2207.01084

  10. [10]

    Advancing candidate link generation for requirements tracing: The study of methods

    Hayes JH, Dekhtyar A, Sundaram SK. Advancing candidate link generation for requirements tracing: The study of methods. IEEE Transactions on Software Engineering. 2006;32(1):4–19

  11. [11]

    Recovering trace- ability links between code and documentation

    Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E. Recovering trace- ability links between code and documentation. IEEE transactions on software engineering. 2002;28(10):970–983. 43

  12. [12]

    Automated techniques for capturing custom traceabil- ity links across heterogeneous artifacts

    Asuncion HU, Taylor RN. Automated techniques for capturing custom traceabil- ity links across heterogeneous artifacts. In: Software and systems traceability. Springer; 2011. p. 129–146

  13. [13]

    Rclinker: Automated linking of issue reports and commits leveraging rich contextual information

    Le TDB, Linares-V´ asquez M, Lo D, Poshyvanyk D. Rclinker: Automated linking of issue reports and commits leveraging rich contextual information. In: 2015 IEEE 23rd international conference on program comprehension. IEEE; 2015. p. 36–47

  14. [14]

    Frlink: Improving the recovery of missing issue- commit links by revisiting file relevance

    Sun Y, Wang Q, Yang Y. Frlink: Improving the recovery of missing issue- commit links by revisiting file relevance. Information and Software Technology. 2017;84:33–47

  15. [15]

    Improving missing issue-commit link recov- ery using positive and unlabeled data

    Sun Y, Chen C, Wang Q, Boehm B. Improving missing issue-commit link recov- ery using positive and unlabeled data. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE; 2017. p. 147–152

  16. [16]

    BTLink: automatic link recovery between issues and commits based on pre-trained BERT model

    Lan J, Gong L, Zhang J, Zhang H. BTLink: automatic link recovery between issues and commits based on pre-trained BERT model. Empirical Software Engineering. 2023;28(4):103

  17. [17]

    Enhancing Traceability Link Recovery with Unlabeled Data

    Zhu J, Xiao G, Zheng Z, Sui Y. Enhancing Traceability Link Recovery with Unlabeled Data. In: 2022 IEEE 33rd International Symposium on Software Reli- ability Engineering (ISSRE); 2022. p. 446–457. ISSN: 2332-6549. Available from: https://ieeexplore.ieee.org/document/9978994

  18. [18]

    GPT-4 Technical Report

    Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt- 4 technical report. arXiv preprint arXiv:230308774. 2023;https://doi.org/https: //doi.org/10.48550/arXiv.2303.08774

  19. [19]

    A general language assistant as a laboratory for alignment

    Askell A, Bai Y, Chen A, Drain D, Ganguli D, Henighan T, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:211200861. 2021

  20. [20]

    Improving requirements tracing via infor- mation retrieval

    Hayes JH, Dekhtyar A, Osborne J. Improving requirements tracing via infor- mation retrieval. In: Proceedings. 11th IEEE International Requirements Engineering Conference, 2003. IEEE; 2003. p. 138–147

  21. [21]

    Recovering traceability links in software artifact management systems using information retrieval meth- ods

    Lucia AD, Fasano F, Oliveto R, Tortora G. Recovering traceability links in software artifact management systems using information retrieval meth- ods. ACM Transactions on Software Engineering and Methodology (TOSEM). 2007;16(4):13–es

  22. [22]

    On the equivalence of infor- mation retrieval methods for automated traceability link recovery

    Oliveto R, Gethers M, Poshyvanyk D, De Lucia A. On the equivalence of infor- mation retrieval methods for automated traceability link recovery. In: 2010 IEEE 18th International Conference on Program Comprehension. IEEE; 2010. p. 68–71. 44

  23. [23]

    A Machine Learning based Traceability Links Classi- fication: A Preliminary Investigation

    Workneh H, Reddivari S. A Machine Learning based Traceability Links Classi- fication: A Preliminary Investigation. In: 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE; 2023. p. 989–990

  24. [24]

    An Empirical Study on Data Balancing in Machine Learning Based Software Traceability Methods

    Wang B, Wang Z, Wan H, Li X, Deng Y. An Empirical Study on Data Balancing in Machine Learning Based Software Traceability Methods. In: 2023 International Joint Conference on Neural Networks (IJCNN). IEEE; 2023. p. 1–8

  25. [25]

    On the effectiveness of auto- mated tracing from model changes to project issues

    van Oosten W, Rasiman R, Dalpiaz F, Hurkmans T. On the effectiveness of auto- mated tracing from model changes to project issues. Information and Software Technology. 2023;160:107226

  26. [26]

    Improving the effectiveness of traceability link recovery using hierar- chical bayesian networks

    Moran K, Palacio DN, Bernal-C´ ardenas C, McCrystal D, Poshyvanyk D, Shenefiel C, et al. Improving the effectiveness of traceability link recovery using hierar- chical bayesian networks. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering; 2020. p. 873–885

  27. [27]

    Automatic traceability maintenance via machine learning classification

    Mills C, Escobar-Avila J, Haiduc S. Automatic traceability maintenance via machine learning classification. In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE; 2018. p. 369–380

  28. [28]

    Traceability in the wild: automatically augmenting incomplete trace links

    Rath M, Rendall J, Guo JL, Cleland-Huang J, M¨ ader P. Traceability in the wild: automatically augmenting incomplete trace links. In: Proceedings of the 40th International Conference on Software Engineering; 2018. p. 834–845

  29. [29]

    Automating traceability link recovery through classification

    Mills C. Automating traceability link recovery through classification. In: Pro- ceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering

  30. [30]

    An Improved Approach to Traceability Recovery Based on Word Embeddings

    Zhao T, Cao Q, Sun Q. An Improved Approach to Traceability Recovery Based on Word Embeddings. In: 2017 24th Asia-Pacific Software Engineering Conference (APSEC); 2017. p. 81–89. Available from: https://ieeexplore.ieee.org/document/ 8305930

  31. [31]

    Adapting word embeddings to traceability recovery

    Tian Q, Cao Q, Sun Q. Adapting word embeddings to traceability recovery. In: 2018 International Conference on Information Systems and Computer Aided Education (ICISCAE). IEEE; 2018

  32. [32]

    Automatic traceability link recovery via active learning

    Du Tb, Shen Gh, Huang Zq, Yu Ys, Wu Dx. Automatic traceability link recovery via active learning. Frontiers of Information Technology & Electronic Engineering. 2020 Aug;21(8):1217–1225. https://doi.org/10.1631/FITEE.1900222

  33. [33]

    Enhancing unsupervised requirements trace- ability with sequential semantics

    Chen L, Wang D, Wang J, Wang Q. Enhancing unsupervised requirements trace- ability with sequential semantics. In: 2019 26th Asia-Pacific Software Engineering Conference (APSEC). IEEE; 2019. p. 23–30. 45

  34. [34]

    Classification or Prompting: A Case Study on Legal Requirements Traceability

    Etezadi R, Abualhaija S, Arora C, Briand LC. Classification or Prompting: A Case Study on Legal Requirements Traceability. CoRR. 2025;abs/2502.04916. https://doi.org/10.48550/ARXIV.2502.04916. 2502.04916

  35. [35]

    Requirements Traceability Link Recovery via Retrieval-Augmented Generation

    Hey T, Fuchß D, Keim J, Koziolek A. Requirements Traceability Link Recovery via Retrieval-Augmented Generation. In: International Working Conference on Requirements Engineering: Foundation for Software Quality. Springer; 2025. p. 381–397

  36. [36]

    Lever- aging Graph-RAG and Prompt Engineering to Enhance LLM-Based Auto- mated Requirement Traceability and Compliance Checks

    Masoudifard A, Sorond MM, Madadi M, Sabokrou M, Habibi E. Lever- aging Graph-RAG and Prompt Engineering to Enhance LLM-Based Auto- mated Requirement Traceability and Compliance Checks. arXiv preprint arXiv:241208593. 2024

  37. [37]

    An LLM-based approach to recover traceability links between secu- rity requirements and goal models

    Hassine J. An LLM-based approach to recover traceability links between secu- rity requirements and goal models. In: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering; 2024. p. 643–651

  38. [38]

    TVR: Automotive System Require- ment Traceability Validation and Recovery Through Retrieval-Augmented Gen- eration

    Niu F, Pan R, Briand LC, Hu H, Koravadi K. TVR: Automotive System Require- ment Traceability Validation and Recovery Through Retrieval-Augmented Gen- eration. arXiv preprint arXiv:250415427. 2025

  39. [39]

    Enabling architecture traceabil- ity by llm-based architecture component name extraction

    Fuchß D, Liu H, Hey T, Keim J, Koziolek A. Enabling architecture traceabil- ity by llm-based architecture component name extraction. In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA). IEEE; 2025. p. 1–12

  40. [40]

    LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation

    Fuchß D, Hey T, Keim J, Liu H, Ewald N, Thirolf T, et al. LiSSA: Toward Generic Traceability Link Recovery through Retrieval-Augmented Generation. In: Proceedings of the IEEE/ACM 47th International Conference on Software Engineering. ICSE. vol. 25; 2025

  41. [41]

    Supporting high-level to low- level requirements coverage reviewing with large language models

    Preda AR, Mayr-Dorn C, Mashkoor A, Egyed A. Supporting high-level to low- level requirements coverage reviewing with large language models. In: Proceedings of the 21st International Conference on Mining Software Repositories; 2024. p. 242–253

  42. [42]

    On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability

    Vogelsang A, Korn A, Broccia G, Ferrari A, Fischbach J, Arora C. On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability. arXiv preprint arXiv:250104810. 2025

  43. [43]

    The prompt report: a systematic survey of prompt engineering techniques

    Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, et al. The prompt report: a systematic survey of prompt engineering techniques. arXiv preprint arXiv:240606608. 2024

  44. [44]

    Prompt programming for large language models: Beyond the few-shot paradigm

    Reynolds L, McDonell K. Prompt programming for large language models: Beyond the few-shot paradigm. In: Extended abstracts of the 2021 CHI conference 46 on human factors in computing systems; 2021. p. 1–7

  45. [45]

    Pre-train, prompt, and pre- dict: A systematic survey of prompting methods in natural language processing

    Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and pre- dict: A systematic survey of prompting methods in natural language processing. ACM computing surveys. 2023;55(9):1–35

  46. [46]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. In: Proceedings of the 30th Conference on Pattern Languages of Programs. PLoP ’23. The Hillside Group; 2023

  47. [47]

    Semantically enhanced software traceability using deep learning techniques

    Guo J, Cheng J, Cleland-Huang J. Semantically enhanced software traceability using deep learning techniques. In: Proceedings of the 39th International Confer- ence on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017; 2017. p. 3–14

  48. [48]

    Semi-supervised pre-processing for learning-based traceability framework on real-world software projects

    Dong L, Zhang H, Liu W, Weng Z, Kuang H. Semi-supervised pre-processing for learning-based traceability framework on real-world software projects. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering; 2022. p. 570–582

  49. [49]

    Available from: https://platform.openai

    OpenAI.: Prompt Engineering Guide. Available from: https://platform.openai. com/docs/guides/prompt-engineering

  50. [50]

    Large language models are zero-shot reasoners

    Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large language models are zero-shot reasoners. Advances in neural information processing systems. 2022;35:22199–22213

  51. [51]

    Chain-of- thought prompting elicits reasoning in large language models

    Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems. 2022;35:24824–24837

  52. [52]

    Revisiting Demon- stration Selection Strategies in In-Context Learning

    Peng K, Ding L, Yuan Y, Liu X, Zhang M, Ouyang Y, et al. Revisiting Demon- stration Selection Strategies in In-Context Learning. In: Ku LW, Martins A, Srikumar V, editors. Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics; 2024. p....

  53. [53]

    Which examples to annotate for in-context learning? towards effective and efficient selection

    Mavromatis C, Srinivasan B, Shen Z, Zhang J, Rangwala H, Faloutsos C, et al. Which examples to annotate for in-context learning? towards effective and efficient selection. arXiv preprint arXiv:231020046. 2023

  54. [54]

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensi- tivity

    Lu Y, Bartolo M, Moore A, Riedel S, Stenetorp P. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensi- tivity. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: 47 Long Papers). Dublin, Ireland: Association for Computa...

  55. [55]

    Active Example Selection for In-Context Learning

    Zhang Y, Feng S, Tan C. Active Example Selection for In-Context Learning. In: Goldberg Y, Kozareva Z, Zhang Y, editors. Proceedings of the 2022 Confer- ence on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022. p. 9134–9148. Available from: https://aclanthology.org/2022.emnlp...

  56. [56]

    Generalizing from a few examples: A survey on few-shot learning

    Wang Y, Yao Q, Kwok JT, Ni LM. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur). 2020;53(3):1–34

  57. [57]

    Active learning literature survey

    Settles B. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences. 2009

  58. [58]

    Exploring Imbalanced Annotations for Effective In-Context Learning

    Gao H, Zhang F, Zeng H, Meng D, Jing B, Wei H. Exploring Imbalanced Annotations for Effective In-Context Learning. arXiv preprint arXiv:250204037. 2025

  59. [59]

    Mitigating Label Biases for In-context Learn- ing

    Fei Y, Hou Y, Chen Z, Bosselut A. Mitigating Label Biases for In-context Learn- ing. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics; 2023. p. 14014–14031. Available from: https://acla...

  60. [60]

    Cross-Domain Requirements Linking via Adversarial-based Domain Adaptation

    Chang Z, Li M, Wang Q, Li S, Wang J. Cross-Domain Requirements Linking via Adversarial-based Domain Adaptation. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE; 2023. p. 1596–1608

  61. [61]

    A machine learn- ing approach for tracing regulatory codes to product specific requirements

    Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J. A machine learn- ing approach for tracing regulatory codes to product specific requirements. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1; 2010. p. 155–164

  62. [62]

    Large language models for software engineering: A systematic literature review

    Hou X, Zhao Y, Liu Y, Yang Z, Wang K, Li L, et al. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology. 2023

  63. [63]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Inui K, Jiang J, Ng V, Wan X, editors. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP). Hong Kong, China: Association for Computational L...

  64. [64]

    Guidelines for empirical studies in software engineering involving large language models

    Baltes S, Angermeir F, Arora C, Bar´ on MM, Chen C, B¨ ohme L, et al. Guidelines for empirical studies in software engineering involving large language models. 48 arXiv preprint arXiv:250815503. 2025

  65. [65]

    Accessed: 2026-01-14

    OpenRouter.: OpenRouter: Unified API for Large Language Models. Accessed: 2026-01-14. https://openrouter.ai

  66. [66]

    What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

    Yang C, Hong Y, Lewis G, Wu T, K¨ astner C. What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing. In: Proceed- ings of the 39th IEEE/ACM International Conference on Automated Software Engineering; 2024. p. 306–318

  67. [67]

    BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding

    Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019. p....

  68. [68]

    Goal-Centric Traceability for Man- aging Non-Functional Requirements

    Cleland-Huang J, Settimi R, BenKhadra O. Goal-Centric Traceability for Man- aging Non-Functional Requirements. Proceedings of the IEEE International Conference on Requirements Engineering. 2007;p. 57–66. https://doi.org/10. 1109/RE.2007.61

  69. [69]

    Adams Re-Trace: Traceability Recovery in Software Artifacts

    De Lucia A, Oliveto R, Sgueglia P. Adams Re-Trace: Traceability Recovery in Software Artifacts. IEEE Transactions on Software Engineering. 2008;34(5):668–

  70. [70]

    https://doi.org/10.1109/TSE.2008.43

  71. [71]

    Toward Reference Models for Requirements Traceability

    Ramesh B, Jarke M. Toward Reference Models for Requirements Traceability. IEEE Transactions on Software Engineering. 2001;27(1):58–93. https://doi.org/ 10.1109/32.895989

  72. [72]

    Model Traceability

    Aizenbud-Reshef N, Nolan BT, Rubin J, Shaham-Gafni Y. Model Traceability. IBM Systems Journal. 2006;45(3):515–526. https://doi.org/10.1147/sj.453.0515

  73. [73]

    Software and systems traceability

    Cleland-Huang J, Gotel O, Zisman A, et al. Software and systems traceability. vol. 2. Springer; 2012. 49