pith. sign in

arxiv: 2510.18787 · v2 · submitted 2025-10-21 · 💻 cs.SE

Characterizing Datasets for LLM-based Requirements Engineering: A Systematic Mapping Study

Pith reviewed 2026-05-18 04:27 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM-based Requirements EngineeringDatasetsSystematic Mapping StudyDataset CharacterizationOpen ScienceElicitation ActivitiesLanguage DiversityRE Tasks
0
0 comments X

The pith

A systematic mapping of 62 public datasets across 45 studies shows that LLM-based requirements engineering research relies on incomplete and imbalanced data resources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper maps publicly available datasets used in LLM-based requirements engineering research and introduces a structured scheme to characterize them across dimensions such as artifact type, granularity, RE activity, task support, domain, and language. It identifies 45 primary studies referencing 62 datasets and highlights clear imbalances in the current landscape. These imbalances include limited coverage of elicitation activities, insufficient language and socio-technical diversity, and incomplete open-science practices like data availability and documentation. The resulting catalogue is intended to help researchers select, compare, and reuse datasets more effectively. This supports stronger empirical work by making dataset limitations visible rather than leaving them scattered across individual papers.

Core claim

By applying a consistent multi-dimensional characterization scheme to the 62 datasets referenced in 45 studies, the mapping reveals that current resources provide uneven support for the full range of RE activities, with particular under-representation in elicitation and limited variety in language and socio-technical contexts, alongside gaps in open-science practices.

What carries the argument

A multi-dimensional characterization scheme for datasets that records artifact type, granularity, RE activity, supported task, application domain, language, and open-science indicators.

If this is right

  • Researchers gain a single catalogue for choosing datasets matched to specific RE tasks such as elicitation or specification.
  • Future dataset creation can target the identified gaps in elicitation activities and language diversity.
  • Comparability across LLM evaluations in RE improves because datasets share documented properties.
  • Empirical studies in LLM-based RE can cite the characterization dimensions to justify dataset selection.
  • Open-science adoption may increase if authors follow the scheme when releasing new datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dataset creators could prioritize non-English stakeholder dialogues to test whether current LLM performance holds outside dominant languages.
  • The characterization scheme could be applied to private or industry datasets to check whether public ones are the main source of imbalance.
  • Longer-term, filling the elicitation gap might require datasets that capture live stakeholder negotiations rather than static requirements documents.
  • The mapping approach itself offers a template for similar characterizations in other subfields that combine LLMs with natural-language engineering tasks.

Load-bearing premise

The 45 studies and 62 datasets located through the mapping process represent the full set of publicly available datasets used in LLM-based RE without major omissions from search limits or publication bias.

What would settle it

A follow-up search that locates more than a handful of additional public datasets used in LLM-based RE studies published in the same period but missed by the original mapping.

Figures

Figures reproduced from arXiv: 2510.18787 by Carlota Catot, Quim Motger, Xavier Franch.

Figure 1
Figure 1. Figure 1: Distribution of datasets across metadata attributes. Labels with fewer [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Usage growth in the number of datasets over time [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) depend on high-quality, domain-specific natural language datasets. This dependency is particularly pronounced in Requirements Engineering (RE), where core activities rely on textual artifacts such as requirements, specifications, and stakeholder feedback. Despite the increasing use of LLMs in RE, data scarcity remains a widely reported limitation. While several datasets support LLM-based RE research, they are scattered across studies and lack systematic characterization, hindering reuse, comparability and assessment. This paper addresses this gap by examining which public datasets are used in LLM-based RE, how they can be consistently characterized, and which RE tasks and dataset properties remain under-represented. We report on a systematic mapping study of 45 primary studies referencing 62 publicly available datasets. Each dataset is characterized using a structured scheme covering multiple dimensions, including relevant descriptors such as artifact type, granularity, RE activity, supported task, application domain, and language, among others. The results reveal notable imbalances, including an incomplete adoption of open-science practices, limited dataset support for elicitation activities, and a lack of language and socio-technical diversity. The resulting catalogue and characterisation scheme support informed dataset selection, comparison, and reuse, contributing to stronger empirical foundations for LLM-based RE research and evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper conducts a systematic mapping study of 45 primary studies that reference 62 publicly available datasets used in LLM-based Requirements Engineering research. It defines and applies a multi-dimensional characterization scheme covering artifact type, granularity, RE activity, supported task, application domain, language, and related properties. The results identify imbalances including incomplete open-science practices, limited support for elicitation activities, and insufficient language and socio-technical diversity, while providing a catalogue intended to improve dataset selection, comparison, and reuse.

Significance. If the mapping and characterization are robust, the work supplies a practical catalogue and scheme that can reduce data-scarcity barriers and improve comparability in LLM-based RE. Explicit identification of gaps in elicitation support and diversity supplies actionable guidance for future dataset creation and could strengthen empirical evaluation practices in the subfield.

major comments (1)
  1. [§3] §3 (Research Method), search protocol and inclusion criteria: the strategy is restricted to indexed English-language venues and excludes preprints; this directly bears on the headline claim of 'lack of language and socio-technical diversity' reported in §5.3–5.4. Without a sensitivity analysis that re-runs the mapping after adding arXiv and non-English sources, it remains unclear whether the observed imbalances are field properties or sampling artifacts.
minor comments (2)
  1. [Table 1] Table 1 and Figure 2: the PRISMA-style flow diagram and dataset-count summary should explicitly state how many of the 62 datasets come from the same primary study to allow readers to assess duplication effects on the imbalance statistics.
  2. [§4.1] §4.1 (Characterization Scheme): the definition of 'socio-technical diversity' is introduced only by example; a short operational definition or coding rubric would improve reproducibility of the diversity assessment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our systematic mapping study. We address the major comment below and propose targeted revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Research Method), search protocol and inclusion criteria: the strategy is restricted to indexed English-language venues and excludes preprints; this directly bears on the headline claim of 'lack of language and socio-technical diversity' reported in §5.3–5.4. Without a sensitivity analysis that re-runs the mapping after adding arXiv and non-English sources, it remains unclear whether the observed imbalances are field properties or sampling artifacts.

    Authors: Our search protocol follows established guidelines for systematic mapping studies (Petersen et al., 2008), which prioritize peer-reviewed publications from indexed databases to ensure rigor, quality control, and reproducibility. Excluding preprints is deliberate, as they may be unstable, updated, or withdrawn. We acknowledge that limiting the search to English-language indexed venues could under-represent non-English publications and thus affect the language diversity assessment; socio-technical diversity is characterized from the included studies' reported domains and contexts. To address this, we will expand the Threats to Validity section to explicitly discuss how the search strategy may influence the observed imbalances and treat it as a study limitation. We maintain that the findings reflect the current state of the peer-reviewed literature captured by our protocol. A full re-execution including arXiv and non-English sources would require substantial new effort disproportionate to a revision; we therefore propose the limitation discussion rather than a complete sensitivity analysis. revision: partial

Circularity Check

0 steps flagged

No circularity: mapping study reports direct observations from literature sample

full rationale

The paper performs a systematic mapping study: it defines a search protocol, identifies 45 primary studies and 62 datasets, applies a characterization scheme, and reports observed imbalances (e.g., limited elicitation support, incomplete open-science practices). These results are empirical summaries of the selected corpus rather than any derivation, prediction, or fitted quantity that reduces to the inputs by construction. No equations, self-definitional claims, fitted-input-as-prediction steps, or load-bearing self-citations appear in the derivation chain. The central findings rest on external literature inspection and are therefore independent of internal redefinition or circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a mapping study the central claim rests on the completeness and validity of the literature search plus the authors' choice of characterization dimensions; no free parameters or invented entities are involved.

axioms (1)
  • domain assumption The search strategy and inclusion/exclusion criteria used to select the 45 primary studies are appropriate and comprehensive for the field.
    Standard assumption in systematic mapping studies; specific details are not provided in the abstract.

pith-pipeline@v0.9.0 · 5748 in / 1182 out tokens · 50210 ms · 2026-05-18T04:27:31.432302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We report on a systematic mapping study of 45 primary studies referencing 62 publicly available datasets. Each dataset is characterized using a structured scheme covering multiple dimensions, including relevant descriptors such as artifact type, granularity, RE activity, supported task, application domain, and language

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The results reveal notable imbalances, including an incomplete adoption of open-science practices, limited dataset support for elicitation activities, and a lack of language and socio-technical diversity

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Data set (2007).https://doi.org/10.5281/zenodo.268542 Mapping Public Datasets for LLM-based RE Tasks 9

    Cleland-Huang, J., Mazrouee, S., Huang, L., Port, D.: PROMISE nfr. Data set (2007).https://doi.org/10.5281/zenodo.268542 Mapping Public Datasets for LLM-based RE Tasks 9

  2. [2]

    Large Language Models for Software Engineering: Survey and Open Problems

    Fan, A., et al.: Large language models for software engineering: Survey and open problems. In: International Conference on Software Engineering. pp. 31–53 (2023). https://doi.org/10.1109/ICSE-FoSE59343.2023.00008

  3. [3]

    Data set (2018).https://doi.org/10.5281/zenodo.1414117

    Ferrari, A., Spagnolo, G.O., Gnesi, S.: PURE: a Dataset of Public Requirements Documents (1.0). Data set (2018).https://doi.org/10.5281/zenodo.1414117

  4. [4]

    González, A., Franch, X., Lo, D., Martínez-Fernández, S.: How do pre-trained models support software engineering? an empirical study in hugging face (2025), https://arxiv.org/abs/2506.03013

  5. [5]

    Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

    Hou, X., et al.: Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol.33(8) (Dec 2024).https: //doi.org/10.1145/3695988

  6. [6]

    Khan, J., et al.: Large language model for requirements engineering: A systematic literature review (2024).https://doi.org/10.21203/rs.3.rs-5589929/v1

  7. [7]

    Future Internet16(6) (2024).https://doi

    Marques, N., Silva, R.R., Bernardino, J.: Using chatgpt in software requirements engineering: A comprehensive review. Future Internet16(6) (2024).https://doi. org/10.3390/fi16060180

  8. [8]

    Scientometrics126(1), 871–906 (2021).https://doi.org/10.1007/ s11192-020-03690-4

    Martín-Martín, A., et al.: Google scholar, microsoft academic, scopus, dimensions, web of science, and opencitations’ coci: a multidisciplinary comparison of coverage via citations. Scientometrics126(1), 871–906 (2021).https://doi.org/10.1007/ s11192-020-03690-4

  9. [9]

    Necula, S.C., Dumitriu, F., Greavu-S,erban, V.: A systematic literature review on usingnaturallanguageprocessinginsoftwarerequirementsengineering.Electronics 13(11) (2024).https://doi.org/10.3390/electronics13112055

  10. [10]

    In: 12th International Conference on Evaluation and Assessment in Software Engineering

    Petersen, K., et al.: Systematic mapping studies in software engineering. In: 12th International Conference on Evaluation and Assessment in Software Engineering. p. 68–77 (2008),https://dl.acm.org/doi/10.5555/2227115.2227123

  11. [11]

    & Poranen, T

    Ronanki, K., Berger, C., Horkoff, J.: Investigating chatgpt’s potential to assist in requirements elicitation processes. In: 2023 49th Euromicro Conference on SEAA. pp. 354–361 (2023).https://doi.org/10.1109/SEAA60479.2023.00061

  12. [12]

    435– 456 (2025).https://doi.org/10.1007/978-3-031-73143-3_16

    Vogelsang, A., Fischbach, J.: Using Large Language Models for Natural Language Processing Tasks in Requirements Engineering: A Systematic Guideline, pp. 435– 456 (2025).https://doi.org/10.1007/978-3-031-73143-3_16

  13. [13]

    Zadenoori, M.A., Dąbrowski, J., Alhoshan, W., Zhao, L., Ferrari, A.: Large lan- guage models (llms) for requirements engineering (re): A systematic literature re- view (2025),https://arxiv.org/abs/2509.11446

  14. [14]

    Natural Language Processing for Requirements Engineering: A Systematic Mapping Study,

    Zhao, L., et al.: Natural language processing for requirements engineering: A sys- tematic mapping study. ACM CSUR (2021).https://doi.org/10.1145/3444689