Characterizing Datasets for LLM-based Requirements Engineering: A Systematic Mapping Study
Pith reviewed 2026-05-18 04:27 UTC · model grok-4.3
The pith
A systematic mapping of 62 public datasets across 45 studies shows that LLM-based requirements engineering research relies on incomplete and imbalanced data resources.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying a consistent multi-dimensional characterization scheme to the 62 datasets referenced in 45 studies, the mapping reveals that current resources provide uneven support for the full range of RE activities, with particular under-representation in elicitation and limited variety in language and socio-technical contexts, alongside gaps in open-science practices.
What carries the argument
A multi-dimensional characterization scheme for datasets that records artifact type, granularity, RE activity, supported task, application domain, language, and open-science indicators.
If this is right
- Researchers gain a single catalogue for choosing datasets matched to specific RE tasks such as elicitation or specification.
- Future dataset creation can target the identified gaps in elicitation activities and language diversity.
- Comparability across LLM evaluations in RE improves because datasets share documented properties.
- Empirical studies in LLM-based RE can cite the characterization dimensions to justify dataset selection.
- Open-science adoption may increase if authors follow the scheme when releasing new datasets.
Where Pith is reading between the lines
- Dataset creators could prioritize non-English stakeholder dialogues to test whether current LLM performance holds outside dominant languages.
- The characterization scheme could be applied to private or industry datasets to check whether public ones are the main source of imbalance.
- Longer-term, filling the elicitation gap might require datasets that capture live stakeholder negotiations rather than static requirements documents.
- The mapping approach itself offers a template for similar characterizations in other subfields that combine LLMs with natural-language engineering tasks.
Load-bearing premise
The 45 studies and 62 datasets located through the mapping process represent the full set of publicly available datasets used in LLM-based RE without major omissions from search limits or publication bias.
What would settle it
A follow-up search that locates more than a handful of additional public datasets used in LLM-based RE studies published in the same period but missed by the original mapping.
Figures
read the original abstract
Large Language Models (LLMs) depend on high-quality, domain-specific natural language datasets. This dependency is particularly pronounced in Requirements Engineering (RE), where core activities rely on textual artifacts such as requirements, specifications, and stakeholder feedback. Despite the increasing use of LLMs in RE, data scarcity remains a widely reported limitation. While several datasets support LLM-based RE research, they are scattered across studies and lack systematic characterization, hindering reuse, comparability and assessment. This paper addresses this gap by examining which public datasets are used in LLM-based RE, how they can be consistently characterized, and which RE tasks and dataset properties remain under-represented. We report on a systematic mapping study of 45 primary studies referencing 62 publicly available datasets. Each dataset is characterized using a structured scheme covering multiple dimensions, including relevant descriptors such as artifact type, granularity, RE activity, supported task, application domain, and language, among others. The results reveal notable imbalances, including an incomplete adoption of open-science practices, limited dataset support for elicitation activities, and a lack of language and socio-technical diversity. The resulting catalogue and characterisation scheme support informed dataset selection, comparison, and reuse, contributing to stronger empirical foundations for LLM-based RE research and evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a systematic mapping study of 45 primary studies that reference 62 publicly available datasets used in LLM-based Requirements Engineering research. It defines and applies a multi-dimensional characterization scheme covering artifact type, granularity, RE activity, supported task, application domain, language, and related properties. The results identify imbalances including incomplete open-science practices, limited support for elicitation activities, and insufficient language and socio-technical diversity, while providing a catalogue intended to improve dataset selection, comparison, and reuse.
Significance. If the mapping and characterization are robust, the work supplies a practical catalogue and scheme that can reduce data-scarcity barriers and improve comparability in LLM-based RE. Explicit identification of gaps in elicitation support and diversity supplies actionable guidance for future dataset creation and could strengthen empirical evaluation practices in the subfield.
major comments (1)
- [§3] §3 (Research Method), search protocol and inclusion criteria: the strategy is restricted to indexed English-language venues and excludes preprints; this directly bears on the headline claim of 'lack of language and socio-technical diversity' reported in §5.3–5.4. Without a sensitivity analysis that re-runs the mapping after adding arXiv and non-English sources, it remains unclear whether the observed imbalances are field properties or sampling artifacts.
minor comments (2)
- [Table 1] Table 1 and Figure 2: the PRISMA-style flow diagram and dataset-count summary should explicitly state how many of the 62 datasets come from the same primary study to allow readers to assess duplication effects on the imbalance statistics.
- [§4.1] §4.1 (Characterization Scheme): the definition of 'socio-technical diversity' is introduced only by example; a short operational definition or coding rubric would improve reproducibility of the diversity assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our systematic mapping study. We address the major comment below and propose targeted revisions to improve the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Research Method), search protocol and inclusion criteria: the strategy is restricted to indexed English-language venues and excludes preprints; this directly bears on the headline claim of 'lack of language and socio-technical diversity' reported in §5.3–5.4. Without a sensitivity analysis that re-runs the mapping after adding arXiv and non-English sources, it remains unclear whether the observed imbalances are field properties or sampling artifacts.
Authors: Our search protocol follows established guidelines for systematic mapping studies (Petersen et al., 2008), which prioritize peer-reviewed publications from indexed databases to ensure rigor, quality control, and reproducibility. Excluding preprints is deliberate, as they may be unstable, updated, or withdrawn. We acknowledge that limiting the search to English-language indexed venues could under-represent non-English publications and thus affect the language diversity assessment; socio-technical diversity is characterized from the included studies' reported domains and contexts. To address this, we will expand the Threats to Validity section to explicitly discuss how the search strategy may influence the observed imbalances and treat it as a study limitation. We maintain that the findings reflect the current state of the peer-reviewed literature captured by our protocol. A full re-execution including arXiv and non-English sources would require substantial new effort disproportionate to a revision; we therefore propose the limitation discussion rather than a complete sensitivity analysis. revision: partial
Circularity Check
No circularity: mapping study reports direct observations from literature sample
full rationale
The paper performs a systematic mapping study: it defines a search protocol, identifies 45 primary studies and 62 datasets, applies a characterization scheme, and reports observed imbalances (e.g., limited elicitation support, incomplete open-science practices). These results are empirical summaries of the selected corpus rather than any derivation, prediction, or fitted quantity that reduces to the inputs by construction. No equations, self-definitional claims, fitted-input-as-prediction steps, or load-bearing self-citations appear in the derivation chain. The central findings rest on external literature inspection and are therefore independent of internal redefinition or circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The search strategy and inclusion/exclusion criteria used to select the 45 primary studies are appropriate and comprehensive for the field.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We report on a systematic mapping study of 45 primary studies referencing 62 publicly available datasets. Each dataset is characterized using a structured scheme covering multiple dimensions, including relevant descriptors such as artifact type, granularity, RE activity, supported task, application domain, and language
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The results reveal notable imbalances, including an incomplete adoption of open-science practices, limited dataset support for elicitation activities, and a lack of language and socio-technical diversity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cleland-Huang, J., Mazrouee, S., Huang, L., Port, D.: PROMISE nfr. Data set (2007).https://doi.org/10.5281/zenodo.268542 Mapping Public Datasets for LLM-based RE Tasks 9
-
[2]
Large Language Models for Software Engineering: Survey and Open Problems
Fan, A., et al.: Large language models for software engineering: Survey and open problems. In: International Conference on Software Engineering. pp. 31–53 (2023). https://doi.org/10.1109/ICSE-FoSE59343.2023.00008
-
[3]
Data set (2018).https://doi.org/10.5281/zenodo.1414117
Ferrari, A., Spagnolo, G.O., Gnesi, S.: PURE: a Dataset of Public Requirements Documents (1.0). Data set (2018).https://doi.org/10.5281/zenodo.1414117
- [4]
-
[5]
Hou, X., et al.: Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol.33(8) (Dec 2024).https: //doi.org/10.1145/3695988
-
[6]
Khan, J., et al.: Large language model for requirements engineering: A systematic literature review (2024).https://doi.org/10.21203/rs.3.rs-5589929/v1
-
[7]
Future Internet16(6) (2024).https://doi
Marques, N., Silva, R.R., Bernardino, J.: Using chatgpt in software requirements engineering: A comprehensive review. Future Internet16(6) (2024).https://doi. org/10.3390/fi16060180
-
[8]
Scientometrics126(1), 871–906 (2021).https://doi.org/10.1007/ s11192-020-03690-4
Martín-Martín, A., et al.: Google scholar, microsoft academic, scopus, dimensions, web of science, and opencitations’ coci: a multidisciplinary comparison of coverage via citations. Scientometrics126(1), 871–906 (2021).https://doi.org/10.1007/ s11192-020-03690-4
work page 2021
-
[9]
Necula, S.C., Dumitriu, F., Greavu-S,erban, V.: A systematic literature review on usingnaturallanguageprocessinginsoftwarerequirementsengineering.Electronics 13(11) (2024).https://doi.org/10.3390/electronics13112055
-
[10]
In: 12th International Conference on Evaluation and Assessment in Software Engineering
Petersen, K., et al.: Systematic mapping studies in software engineering. In: 12th International Conference on Evaluation and Assessment in Software Engineering. p. 68–77 (2008),https://dl.acm.org/doi/10.5555/2227115.2227123
-
[11]
Ronanki, K., Berger, C., Horkoff, J.: Investigating chatgpt’s potential to assist in requirements elicitation processes. In: 2023 49th Euromicro Conference on SEAA. pp. 354–361 (2023).https://doi.org/10.1109/SEAA60479.2023.00061
-
[12]
435– 456 (2025).https://doi.org/10.1007/978-3-031-73143-3_16
Vogelsang, A., Fischbach, J.: Using Large Language Models for Natural Language Processing Tasks in Requirements Engineering: A Systematic Guideline, pp. 435– 456 (2025).https://doi.org/10.1007/978-3-031-73143-3_16
- [13]
-
[14]
Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,
Zhao, L., et al.: Natural language processing for requirements engineering: A sys- tematic mapping study. ACM CSUR (2021).https://doi.org/10.1145/3444689
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.