pith. sign in

arxiv: 2607.00427 · v1 · pith:NBHZF37Tnew · submitted 2026-07-01 · 💻 cs.SE

BT-APE: A Computationally Light Backtracking Approach to Automatic Prompt Engineering for Requirements Classification

Pith reviewed 2026-07-02 09:10 UTC · model grok-4.3

classification 💻 cs.SE
keywords automatic prompt engineeringrequirements classificationbacktracking searchlarge language modelsprompt optimizationsoftware requirements engineeringcomputational efficiency
0
0 comments X

The pith

BT-APE produces prompts for requirements classification that reach the accuracy of heavy automatic methods while using 72 percent fewer input tokens and 66 percent less time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames prompt design for large language models as an optimization problem and introduces BT-APE, which generates candidate prompts, applies backtracking search to explore them, and uses dynamic example selection to refine choices. On three benchmark datasets and five instruction-tuned models, this approach reaches accuracy levels nearly identical to the more expensive PE2 baseline and well above classical methods such as zero-shot, few-shot, and chain-of-thought prompting. The decisive difference is the reduced computational load, which the authors quantify through token counts and wall-clock measurements. If the claim holds, automated prompt construction becomes practical in settings where repeated calls to large models carry high cost or latency constraints. The work also supplies an interactive tool and replication package to support further use.

Core claim

BT-APE iteratively refines prompts for requirements classification by having an LLM propose candidates, then applying backtracking search and dynamic example selection to identify high-performing prompts. Across three datasets and five LLMs it matches the accuracy of the resource-intensive PE2 method while consuming roughly 72 percent fewer input tokens and 66 percent less wall-clock time, and it substantially outperforms four classical prompting baselines with large effect sizes.

What carries the argument

Backtracking search over LLM-generated prompt candidates combined with dynamic example selection.

If this is right

  • Requirements classification tasks can shift from manual trial-and-error prompt writing to an automated search process.
  • Automated prompt engineering becomes viable under tighter compute budgets than those required by prior heavy methods.
  • Performance advantages over zero-shot, few-shot, chain-of-thought, and combined baselines hold across multiple datasets and models.
  • Class definitions supplied in the initial prompt influence how the search evolves later candidates.
  • An open tool and replication package make the method immediately usable for other requirements engineering classification problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same backtracking structure could be tested on related requirements engineering tasks such as traceability link recovery or ambiguity detection.
  • Lower token consumption might allow the optimization loop to run repeatedly inside interactive development environments without noticeable delay.
  • The observed interaction between class definitions and prompt evolution supplies a concrete starting point for studying how initial human-written constraints shape automated search outcomes.

Load-bearing premise

The combination of backtracking search and dynamic example selection will locate prompts that continue to perform well on datasets and models outside the three benchmarks and five LLMs examined.

What would settle it

Running BT-APE on a fourth independent requirements classification dataset or with a sixth LLM and finding that its accuracy falls materially below PE2 or fails to exceed the classical baselines.

Figures

Figures reproduced from arXiv: 2607.00427 by Alessio Ferrari, Jacek D\k{a}browski, Liping Zhao, Mohammad Amin Zadenoori, Waad Alhoshan.

Figure 1
Figure 1. Figure 1: Overview of the BT-APE workflow applied to requirements classification, alongside Algorithm 1. Steps [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average F1 of the five instruction-tuned LLMs across the six prompting strategies on the three [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data & Config. The dataset is uploaded as a CSV (text first, label last); the tool infers the classes and their distribution, selects the backend and model, and exposes the run hyper-parameters as sliders. 1 · Data & Config 2 · Prompt 3 · Baselines 4 · APE Optimisation 5 · Results Fixed prompt(never changes) You are a precise text classifier. Classify text into exactly one of the provided categories. Outpu… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt. The prompt is separated into a fixed task-framing part and an optimizable definitions part (highlighted); BT-APE rewrites only the latter. coherence but also the model’s ability to reproduce realistic requirement granularity. The results of this analysis will be reported accordingly. The Jaccard similarity coefficient[11], often denoted as 𝐽 (𝐴, 𝐵), is a statistic used for comparing the similarity … view at source ↗
Figure 5
Figure 5. Figure 5: BT-APE Optimisation. Interactive or automatic execution of Algorithm 1. The iteration log and colored history strip (green = improvement, amber = backtrack), the live best prompt and its evolving optimizable section, and the per-class metrics are shown together, exposing the optimization trajectory. 1 · Data & Config 2 · Prompt 3 · Baselines 4 · APE Optimisation 5 · Results Strategy zero_shot few_shot cot … view at source ↗
Figure 6
Figure 6. Figure 6: Baselines and Results. (a) Each prompting strategy is run and scored per class. (b) The session summary contrasts the baselines with the BT-APE-optimized prompt; the optimized prompt yields the highest macro F1 and improves the weakest classes. and natural language processing, particularly for binary attribute comparison and for measuring the overlap in cluster analysis or document tokenization. As reporte… view at source ↗
Figure 7
Figure 7. Figure 7: Jaccard similarity scores across different models and datasets. Higher values would indicate greater [PITH_FULL_IMAGE:figures/full_fig_p054_7.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly applied to requirements engineering (RE) tasks, yet the prompts guiding them are typically designed manually through trial and error, yielding inconsistent and suboptimal results. Automated prompt construction remains largely unexplored in RE, leaving its effectiveness unclear. To address this, we propose a lightweight Automatic Prompt Engineering approach, Backtracking APE (BT-APE), and apply it to requirements classification. We frame prompt design as an optimization problem, iteratively refining prompts via LLM-generated candidates, backtracking search, and dynamic example selection. Evaluating BT-APE on three benchmark datasets with five instruction-tuned LLMs, we compare it against four classical prompting baselines (zero-shot, few-shot, chain-of-thought, CoT+few-shot) and a state-of-the-art but resource-intensive APE baseline (PE2). BT-APE and PE2 achieve nearly identical accuracy, both substantially outperforming the classical baselines with large effect sizes; however, BT-APE imposes a far lighter computational footprint, consuming roughly 72% fewer input tokens and 66% less wall-clock time at equivalent accuracy, making it better suited to resource-constrained deployment. Our contributions are threefold: (i) a lightweight APE framework with an open interactive tool and replication package; (ii) the first systematic comparison of APE against classical prompting for requirements classification; and (iii) insights into how class definitions and prompt evolution affect performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes BT-APE, a lightweight backtracking automatic prompt engineering method for requirements classification. It models prompt design as an iterative optimization using LLM-generated candidates, backtracking search, and dynamic example selection. On three benchmark datasets and five instruction-tuned LLMs, BT-APE matches the accuracy of the resource-intensive PE2 baseline while using ~72% fewer input tokens and ~66% less wall-clock time, and substantially outperforms classical baselines (zero-shot, few-shot, CoT, CoT+few-shot) with large effect sizes. Contributions include an open tool and replication package, the first systematic APE vs. classical prompting comparison in RE, and insights on class definitions and prompt evolution.

Significance. If the efficiency claims hold under broader testing, the work offers a practical advance for requirements engineering by enabling effective prompt optimization in resource-constrained environments, where full APE methods like PE2 are impractical. The open replication package and tool are positive for reproducibility.

major comments (2)
  1. [§4 Evaluation and Table 3] §4 Evaluation and Table 3 (results): The reported accuracy parity with PE2 and large effect sizes over baselines are presented without statistical tests, variance across LLM stochastic runs, or controls for prompt variability, leaving the robustness of the central performance claims unverified.
  2. [§5 Discussion and §6 Conclusion] §5 Discussion and §6 Conclusion: The claim that BT-APE is 'better suited to resource-constrained deployment' rests entirely on results from three specific datasets and five LLMs; no cross-dataset or cross-LLM generalization experiments are reported, which is load-bearing for separating the efficiency advantage from potential overfitting to the evaluation scope.
minor comments (2)
  1. [§3 Method] The abstract and method description should clarify the exact backtracking termination criteria and dynamic example selection heuristic with pseudocode or equations for reproducibility.
  2. [Figure 2] Figure 2 (prompt evolution example) would benefit from explicit annotation of backtracking steps and token counts to directly illustrate the claimed efficiency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and note the planned revisions.

read point-by-point responses
  1. Referee: [§4 Evaluation and Table 3] The reported accuracy parity with PE2 and large effect sizes over baselines are presented without statistical tests, variance across LLM stochastic runs, or controls for prompt variability, leaving the robustness of the central performance claims unverified.

    Authors: We agree that the absence of statistical tests and variance reporting weakens the robustness claims. In the revised manuscript we will rerun all experiments with five random seeds per configuration, report means and standard deviations, and apply paired statistical tests (t-tests or Wilcoxon signed-rank) to the accuracy differences. These additions will be incorporated into §4 and Table 3. revision: yes

  2. Referee: [§5 Discussion and §6 Conclusion] The claim that BT-APE is 'better suited to resource-constrained deployment' rests entirely on results from three specific datasets and five LLMs; no cross-dataset or cross-LLM generalization experiments are reported, which is load-bearing for separating the efficiency advantage from potential overfitting to the evaluation scope.

    Authors: The current evaluation already spans three datasets and five LLMs with consistent efficiency gains. We acknowledge that additional cross-validation experiments would provide stronger separation from overfitting. In revision we will qualify the deployment claim in §5 and §6, explicitly state the evaluation scope as a limitation, and temper language to avoid implying broad generalization beyond the tested setting. revision: partial

Circularity Check

0 steps flagged

No circularity: independent search procedure evaluated on external benchmarks

full rationale

The paper presents BT-APE as a backtracking search algorithm with dynamic example selection for prompt optimization, evaluated directly against classical baselines and PE2 on three fixed benchmark datasets and five LLMs. No equations, fitted parameters, or self-referential derivations appear; performance claims rest on empirical comparisons rather than any reduction of outputs to inputs by construction. No self-citations are load-bearing for the central method or results. This is a standard empirical methods paper with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the approach rests on standard search algorithms and existing LLM capabilities.

pith-pipeline@v0.9.1-grok · 5804 in / 1089 out tokens · 34019 ms · 2026-07-02T09:10:47.501523+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 33 canonical work pages · 4 internal anchors

  1. [1]

    ISO/IEC/IEEE International Standard - Systems and software engineering – Life cycle processes – Requirements engineering.ISO/IEC/IEEE 29148:2018(E)(2018), 1–104

    2018. ISO/IEC/IEEE International Standard - Systems and software engineering – Life cycle processes – Requirements engineering.ISO/IEC/IEEE 29148:2018(E)(2018), 1–104. doi:10.1109/IEEESTD.2018.8559686

  2. [2]

    Waleed Abdeen, Michael Unterkalmsteiner, Krzysztof Wnuk, Alessio Ferrari, and Panagiota Chatzipetrou. 2025. Language models to support multi-label classification of industrial data. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 45–55

  3. [3]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  4. [4]

    Waad Alhoshan, Alessio Ferrari, and Liping Zhao. 2023. Zero-shot learning for requirements classification: An exploratory study.Information and Software Technology159 (2023), 107202

  5. [5]

    Tawfeeq Alsanoosy. 2025. Large Language Model for Requirements Classification: An Ensemble Approach.Procedia Computer Science270 (2025), 3648–3657. doi:10.1016/j.procs.2025.09.490 29th International Conference on Knowledge- Based and Intelligent Information & Engineering Systems (KES 2025)

  6. [6]

    2024.Advancing Requirements Engineering Through Generative AI: Assessing the Role of LLMs

    Chetan Arora and et al. 2024.Advancing Requirements Engineering Through Generative AI: Assessing the Role of LLMs. Springer Nature Switzerland, 129–148

  7. [7]

    Sarmad Bashir, Muhammad Abbas, Alessio Ferrari, Mehrdad Saadatmand, and Pernilla Lindberg. 2023. Requirements classification for smart allocation: A case study in the railway industry. In2023 IEEE 31st International Requirements Engineering Conference (RE). IEEE, 201–211

  8. [8]

    Manal Binkhonain and Reem Alfayez. 2025. Are prompts all you need? Evaluating prompt-based Large Language Models (LLM)s for software requirements classification.Requir. Eng.30, 4 (Sept. 2025), 423–443. doi:10.1007/s00766-025-00451-8

  9. [9]

    Fairley (Eds.)

    Pierre Bourque and Richard E. Fairley (Eds.). 2024.Guide to the Software Engineering Body of Knowledge (SWEBOK) (version 4.0 ed.). IEEE Computer Society. https://www.computer.org/education/bodies-of-knowledge/software- engineering/v4 Available online: https://www.computer.org/education/bodies-of-knowledge/software-engineering/v4

  10. [10]

    Haowei Cheng, Jati H Husen, Yijun Lu, Teeradaj Racharak, Nobukazu Yoshioka, Naoyasu Ubayashi, and Hironori Washizaki. 2024. Generative ai for requirements engineering: A systematic literature review.arXiv preprint arXiv:2409.06741(2024)

  11. [11]

    Flavio Chierichetti and Ravi Kumar. 2015. LSH-Preserving Functions and Their Applications.J. ACM62, 5, Article 33 (Nov. 2015), 25 pages. doi:10.1145/2816813

  12. [12]

    Jane Cleland-Huang, Sepideh Mazrouee, Huang Liguo, and Dan Port. 2007. NFR Dataset. doi:10.5281/zenodo.268542

  13. [13]

    Jane Cleland-Huang, Raffaella Settimi, Xuchang Zou, and Peter Solc. 2007. Automated classification of non-functional requirements.Requirements engineering12, 2 (2007), 103–120

  14. [14]

    Fabiano Dalpiaz, Davide Dell’Anna, Fatma Basak Aydemir, and Sercan Çevikol. 2019. Requirements classification with interpretable machine learning and dependency parsing. InRE’19. IEEE, 142–152

  15. [15]

    De Martino, M

    V. De Martino, M. A. Zadenoori, X. Franch, and A. Ferrari. 2025. Green Prompt Engineering: Investigating the Energy Impact of Prompt Design in Software Engineering.arXiv preprint arXiv:2509.22320(2025)

  16. [16]

    Ernst and John Mylopoulos

    Neil A. Ernst and John Mylopoulos. 2010. On the Perception of Software Quality Requirements during the Project Lifecycle. InRequirements Engineering: Foundation for Software Quality, Roel Wieringa and Anne Persson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 143–157

  17. [17]

    Alessio Ferrari, Sallam Abualhaija, and Chetan Arora. 2024. Model Generation with LLMs: From Requirements to UML Sequence Diagrams. InREW’24. IEEE, 291–300

  18. [18]

    Alessio Ferrari, Felice Dell’Orletta, Andrea Esuli, Vincenzo Gervasi, Stefania Gnesi, et al . 2017. Natural language requirements processing: a 4D vision.IEEE Software34, 6 (2017), 28–35. J. ACM, Vol. 00, No. 0, Article 111. Publication date: January 2026. BT-APE: A Computationally Light Backtracking Approach to Automatic Prompt Engineering for Requiremen...

  19. [19]

    Martin Glinz. 2007. On Non-Functional Requirements. In15th IEEE International Requirements Engineering Conference (RE 2007). 21–26. doi:10.1109/RE.2007.45

  20. [20]

    Charles Haley, Robin Laney, Jonathan Moffett, and Bashar Nuseibeh. 2008. Security Requirements Engineering: A Framework for Representation and Analysis.IEEE Trans. Softw. Eng.34, 1 (Jan. 2008), 133–153. doi:10.1109/TSE.2007. 70754

  21. [21]

    Tobias Hey, Jan Keim, Anne Koziolek, and Walter F Tichy. 2020. NoRBERT: Transfer learning for requirements classification. InRE’20. IEEE, 169–179

  22. [22]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  23. [23]

    Kaicheng Huang, Fanyu Wang, Yutan Huang, and Chetan Arora. 2025. Prompt Engineering for Requirements Engineering: A Literature Review and Roadmap.arXiv preprint arXiv:2507.07682(2025)

  24. [24]

    Daan Kepel and Konstantina Valogianni. 2024. Autonomous Prompt Engineering in Large Language Models. arXiv:2407.11000 [cs.CL] https://arxiv.org/abs/2407.11000

  25. [25]

    Barbara Kitchenham, Lech Madeyski, and David Budgen. 2022. SEGRESS: Software Engineering Guidelines for Reporting Secondary Studies.IEEE Transactions on Software Engineering49, 3 (2022)

  26. [26]

    Eric Knauss, Siv Houmb, Kurt Schneider, Shareeful Islam, and Jan Jürjens. 2011. Supporting requirements engineers in recognising security issues. InRequirements Engineering: Foundation for Software Quality: 17th International Working Conference, (REFSQ 2011)(28-30). Springer, Essen, Germany, 4–18

  27. [27]

    Eric Knauss, Siv Hilde Houmb, Shareeful Islam, Jan Jürjens, and Kurt Schneider. 2021. SecReq. doi:10.5281/zenodo. 4530183

  28. [28]

    Armin Kobilica, Mohammed Ayub, and Jameleddine Hassine. 2020. Automated identification of security requirements: A machine learning approach. InProceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering. 475–480

  29. [29]

    Zijad Kurtanović and Walid Maalej. 2017. Automatically classifying functional and non-functional requirements using supervised machine learning. In2017 IEEE 25th International Requirements Engineering Conference (RE). IEEE, Lisbon, Portugal, 490–495

  30. [30]

    Minchan Kwon, Gaeun Kim, Jongsuk Kim, Haeil Lee, and Junmo Kim. 2024. StablePrompt: Automatic Prompt Tuning using Reinforcement Learning for Large Language Models.arXiv preprint arXiv:2410.07652(2024)

  31. [31]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.Advances in Neural Information Processing Systems33 (2020)

  32. [32]

    Manjeshwar Aniruddh Mallya, Alessio Ferrari, Mohammad Amin Zadenoori, and Jacek Dabrowski. 2026. From Online User Feedback to Requirements: Evaluating Large Language Models for Classification and Specification Tasks. In Requirements Engineering: Foundation for Software Quality - 32nd International Working Conference, REFSQ 2026, Poznań, Poland, March 23-2...

  33. [33]

    M. A. Mallya, A. Ferrari, M. A. Zadenoori, and J. Dąbrowski. 2026. RITA: A Tool for Automated Requirements Classification and Specification from Online User Feedback.arXiv preprint arXiv:2601.11362(2026)

  34. [34]

    Alessio Miaschi, Felice Dell’Orletta, and Giulia Venturi. 2024. Evaluating Large Language Models via Linguistic Profiling. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 2835–2848. doi:10.1865...

  35. [35]

    Thomas Olsson, Séverine Sentilles, and Efi Papatheocharous. 2022. A systematic literature review of empirical research on quality requirements.Requirements Engineering27, 2 (June 2022), 249–271. doi:10.1007/s00766-022-00373-9

  36. [36]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

  37. [37]

    Jordan Peer, Yaniv Mordecai, and Yoram Reich. 2024. NLP4ReF: Requirements classification and forecasting: From model-based design to large language models. In2024 IEEE Aerospace Conference. IEEE, 1–16

  38. [38]

    Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for Conducting Systematic Mapping Studies in Software Engineering: An Update.Information and Software Technology64 (2015)

  39. [39]

    Gradient Descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. 2023. Automatic Prompt Optimization with "Gradient Descent" and Beam Search. arXiv:2305.03495 [cs.CL] https://arxiv.org/abs/2305.03495

  40. [40]

    Yuman Qin and Rong Peng. 2025. ChatNRC: A Non-functional Requirement Classification Framework Based on a Generative and Discriminative Mechanism. In2025 32nd Asia-Pacific Software Engineering Conference (APSEC). 467–478. J. ACM, Vol. 00, No. 0, Article 111. Publication date: January 2026. 111:42 Zadenoori et al. doi:10.1109/APSEC66846.2025.00052

  41. [41]

    Gokul Rejithkumar and Preethu Rose Anish. 2025. NICE: Non-Functional Requirements Identification, Classification, and Explanation Using Small Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 284–295. doi:10.1109/ICSE-SEIP66354.2025.00031

  42. [42]

    Alberto D Rodriguez, Katherine R Dearstyne, and Jane Cleland-Huang. 2023. Prompts matter: Insights and strategies for prompt engineering in automated software traceability. InREW’23. IEEE, 455–464

  43. [43]

    Breaux, Thomas B

    Sarah Santos, Travis D. Breaux, Thomas B. Norton, Sara Haghighi, and Sepideh Ghanavati. 2024. Requirements Satisfiability with In-Context Learning. InRE’24. IEEE, 168–179

  44. [44]

    Md Shafikuzzaman, Md Rakibul Islam, Shuaib Zaman, Andrew Ma, and Anwarul Islam Sifat. 2025. On the Effectiveness of Zero-Shot and Few-Shot Pretrained Language Models for Software Requirement Classification.IEEE Access13 (2025), 159439–159453. doi:10.1109/ACCESS.2025.3607813

  45. [45]

    Guttorm Sindre and Andreas Opdahl. 2003. A Reuse-Based Approach to Determining Security Requirements. (05 2003)

  46. [46]

    Hamed Taherkhani, Melika Sepidband, Hung Viet Pham, Song Wang, and Hadi Hemmati. 2026. Automated Prompt Engineering for Cost-Effective Code Generation Using Evolutionary Algorithms.ACM Trans. Softw. Eng. Methodol. (March 2026). doi:10.1145/3805704 Just Accepted

  47. [47]

    Maciej Tomczak and Ewa Tomczak-Łukaszewska. 2014. The need to report effect size estimates revisited. An overview of some recommended measures of effect size. 21 (01 2014), 19–25

  48. [48]

    Jonathan Ullrich, Matthias Koch, and Andreas Vogelsang. 2025. From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering. In2025 IEEE 33rd International Requirements Engineering Conference (RE). 257–266

  49. [49]

    Vasily Varenov and Aydar Gabdrahmanov. 2021. Security requirements classification into groups using nlp transformers. In2021 IEEE 29th International Requirements Engineering Conference Workshops (REW). IEEE, Notre Dame, IN, USA, 444–450

  50. [50]

    Andreas Vogelsang and Jannik Fischbach. 2024. Using Large Language Models for Natural Language Processing Tasks in Requirements Engineering: A Systematic Guideline. arXiv:2402.13823 [cs.SE] https://arxiv.org/abs/2402.13823

  51. [51]

    Ruochen Wang, Sohyun An, Minhao Cheng, Tianyi Zhou, Sung Ju Hwang, and Cho-Jui Hsieh. 2024. One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts. arXiv:2407.00256 [cs.AI] https://arxiv.org/abs/ 2407.00256

  52. [52]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.arXiv preprint arXiv:2201.11903 (2022)

  53. [53]

    Qinyuan Ye, Maxamed Axmed, Reid Pryzant, and Fereshte Khani. 2024. Prompt Engineering a Prompt Engineer. arXiv:2311.05661 [cs.CL] https://arxiv.org/abs/2311.05661

  54. [54]

    Amin Zadenoori. 2026.aminzadenoori/Backtracking-enhanced-Automatic- Prompt-Engineering-APE-for-requirements- classification.: APE Classification Tool— Backtracking-enhanced Automatic Prompt Engineering for Requirements Classifi- cation. doi:10.5281/zenodo.20438927

  55. [55]

    {Mohammad Amin} Zadenoori, Liping Zhao, Waad Alhoshan, and Alessio Ferrari. 2025. Automatic Prompt Engineering: The Case of Requirements Classification. InRequirements Engineering: Foundation for Software Quality (REFSQ) (Lecture Notes in Computer Science, Vol. 15588). Springer Nature, United States, 217–225

  56. [56]

    M. A. Zadenoori, V. De Martino, J. Dabrowski, X. Franch, and A. Ferrari. 2025. Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification.arXiv preprint arXiv:2510.21443(2025)

  57. [57]

    Mohammad Amin Zadenoori, Jacek Dąbrowski, Waad Alhoshan, Liping Zhao, and Alessio Ferrari. 2025. Large Language Models (LLMs) for Requirements Engineering (RE): A Systematic Literature Review. arXiv:2509.11446 [cs.SE] https://arxiv.org/abs/2509.11446

  58. [58]

    Liping Zhao, Waad Alhoshan, Alessio Ferrari, Keletso J Letsholo, Muideen A Ajagbe, Erol-Valeriu Chioasca, and Riza T Batista-Navarro. 2021. Natural language processing for requirements engineering: A systematic mapping study.ACM Computing Surveys (CSUR)54, 3 (2021), 1–41

  59. [59]

    Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, and David Lo. 2025. Lessleak-bench: A first investigation of data leakage in llms across 83 software engineering benchmarks.arXiv preprint arXiv:2502.06215(2025)

  60. [60]

    Large Language Models Are Human-Level Prompt Engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large Language Models Are Human-Level Prompt Engineers. arXiv:2211.01910 [cs.LG] https://arxiv.org/abs/2211.01910 J. ACM, Vol. 00, No. 0, Article 111. Publication date: January 2026. BT-APE: A Computationally Light Backtracking Approach to Automat...