Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis

Alexander Felfernig; Damian Garber; Sebastian Lubos; Thi Ngoc Trang Tran; Viet-Man Le

arxiv: 2604.20523 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.AI

Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis

Viet-Man Le , Thi Ngoc Trang Tran , Sebastian Lubos , Alexander Felfernig , Damian Garber This is my paper

Pith reviewed 2026-05-09 23:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords large language modelsfeature modelssoftware product linesanalysis operationssemi-formal blueprintsvariability validationearly validationmodel analysis

0 comments

The pith

Large language models achieve 88-89% accuracy performing analysis operations on semi-formal feature model blueprints, approaching solver performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can execute standard feature model analysis operations directly on concise textual blueprints that describe feature hierarchies and constraints in constrained language. It evaluates twelve current models against the FLAMA solver oracle across sixteen operations and multiple blueprints. Reasoning-optimized models reach 88-89% average accuracy, indicating they could serve as quick checks during the earliest phases of software product line scoping before formal models are built. The work also documents recurring errors in how models parse structure and reason about constraints, plus trade-offs between accuracy and inference cost. These results frame LLMs as accessible, low-overhead aids rather than replacements for dedicated solvers.

Core claim

The paper establishes that reasoning-optimized large language models achieve an average accuracy of 88-89% when performing sixteen standard analysis operations on semi-formal textual blueprints of feature models. These blueprints provide concise descriptions of feature hierarchies and constraints. By comparing LLM outputs to those from the solver-based tool FLAMA, the study shows that top models approach solver correctness. The findings also catalog common error types and accuracy versus cost considerations for model choice. This supports using LLMs as lightweight aids for validating variability early in software product line scoping.

What carries the argument

LLM execution of analysis operations on semi-formal blueprints, where models read constrained-language text describing feature hierarchies and constraints and produce outputs for operations such as consistency checking or dead-feature detection, benchmarked directly against a solver oracle.

If this is right

Early variability checks become feasible without first building a complete formal feature model.
Teams can choose specific models by balancing the observed accuracy levels against their inference costs.
Common error patterns in parsing and constraint reasoning point to targeted prompt or fine-tuning improvements.
The approach lowers the expertise threshold for initial product line validation steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same blueprint style could be applied to other semi-formal artifacts such as early requirements documents.
Hybrid pipelines that route quick LLM scans to a solver only when needed may become practical.
Accuracy on larger or more intricate blueprints will likely rise as reasoning capabilities in models continue to advance.
Real industry blueprints would provide a stronger test of whether the 88-89% figure holds outside the selected cases.

Load-bearing premise

That success on the chosen blueprints and sixteen operations will carry over to real early-stage product line scoping work.

What would settle it

A follow-up evaluation on a fresh set of industry blueprints in which the strongest LLMs produce incorrect results on more than 15 percent of operations relative to the solver.

Figures

Figures reproduced from arXiv: 2604.20523 by Alexander Felfernig, Damian Garber, Sebastian Lubos, Thi Ngoc Trang Tran, Viet-Man Le.

**Figure 1.** Figure 1: Prompting pipeline for dead feature detection, showing how the system and user prompts guide reasoning and how the LLM outputs XML results for solver comparison. 4.6 Baseline Oracle and Implementation We use FLAMA 2.0.1 [12] as the solver-based oracle, executing AOs on UVL inputs through the Glucose3 SAT solver and the DD library to obtain exact and reproducible ground-truth results for all comparisons. Th… view at source ↗

**Figure 2.** Figure 2: Accuracy (%) of general-purpose LLMs across AOs. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy (%) of reasoning-opt. LLMs across AOs. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy (%) of general-purpose LLMs on 16 AOs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

We study whether Large Language Models (LLMs) can perform feature model analysis operations (AOs) directly on semi-formal textual blueprints, i.e., concise constrained-language descriptions of feature hierarchies and constraints, enabling early validation in Software Product Line scoping. Using 12 state-of-the-art LLMs and 16 standard AOs, we compare their outputs against the solver-based oracle FLAMA. Results show that reasoning-optimized models (e.g., Grok 4 Fast Reasoning, Gemini 2.5 Pro) achieve 88-89% average accuracy across all evaluated blueprints and operations, approaching solver correctness. We identify systematic errors in structural parsing and constraint reasoning, and highlight accuracy-cost trade-offs that inform model selection. These findings position LLMs as lightweight assistants for early variability validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs hit 88-89% on feature model analysis from semi-formal blueprints but the 12 cases are too narrow to back the early-validation claims.

read the letter

LLMs hit 88-89% on feature model analysis from semi-formal blueprints but the 12 cases are too narrow to back the early-validation claims. The paper tests 12 models on 16 standard operations using 12 textual blueprints and checks results against the FLAMA solver. Reasoning models come out ahead, and the work flags recurring problems in parsing and constraint handling while noting cost differences. That comparison is the concrete new piece here. It is a straightforward empirical check in the product-line space and uses an external oracle, which keeps the numbers honest. The error breakdown and accuracy-cost notes are the parts that could actually help someone decide whether to try an LLM for quick scoping checks. The main weakness is the test set. Twelve blueprints and sixteen operations do not show they cover the range of feature counts, constraint density, or ambiguity that appears in real early-stage product lines. The paper treats exact solver match as the success metric, yet early validation often tolerates partial or approximate answers when the blueprint is still rough. Without more on how the blueprints were picked or tests on larger or noisier examples, the 88-89% figure stays tied to this specific collection. This is for software product line researchers who are already looking at LLM support for variability work. It supplies a usable baseline on current model performance but will not shift standard practice on its own. I would send it for peer review. The empirical setup is clear enough that referees can assess the blueprint selection and ask for the expansions needed to make the generalization argument stronger.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates whether LLMs can execute 16 standard feature model analysis operations directly on semi-formal textual blueprints (concise constrained-language descriptions of feature hierarchies and constraints) to support early validation during software product line scoping. It evaluates 12 LLMs against the independent FLAMA solver oracle on 12 blueprints, reports that reasoning-optimized models (e.g., Grok 4 Fast Reasoning, Gemini 2.5 Pro) reach 88-89% average accuracy, identifies systematic errors in structural parsing and constraint reasoning, and discusses accuracy-cost trade-offs for model selection.

Significance. If the accuracy generalizes, the work could enable lightweight, solver-free initial checks on variability descriptions, lowering the barrier to early SPL scoping. Credit is due for the use of an external independent oracle (FLAMA) yielding concrete accuracy figures and for explicitly cataloging error categories rather than treating LLMs as black boxes.

major comments (2)

[Evaluation] Evaluation section: the central claim that reasoning-optimized LLMs approach solver correctness (88-89% accuracy) and can serve as practical early-validation assistants rests on the assumption that the 12 chosen blueprints and 16 operations are representative of real early-stage product-line scoping; the manuscript provides no data on feature counts, constraint densities, or ambiguity levels spanned by the blueprints, leaving generalizability unproven.
[Results] Results and abstract: the reported accuracy figures are presented without accompanying statistical tests, full prompt-engineering details, or a complete per-operation/per-model error breakdown, which weakens the evidential support for the headline performance numbers and the identified systematic errors.

minor comments (1)

[Abstract] Abstract: the opening sentence would be clearer if it stated the exact number of blueprints (12) alongside the 12 LLMs and 16 operations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of evaluation design and result presentation that we will address to improve the manuscript's rigor and transparency.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim that reasoning-optimized LLMs approach solver correctness (88-89% accuracy) and can serve as practical early-validation assistants rests on the assumption that the 12 chosen blueprints and 16 operations are representative of real early-stage product-line scoping; the manuscript provides no data on feature counts, constraint densities, or ambiguity levels spanned by the blueprints, leaving generalizability unproven.

Authors: We agree that the manuscript lacks explicit quantitative characterization of the blueprints. In the revision we will add a dedicated table in the Evaluation section reporting, for each of the 12 blueprints, the number of features, number of cross-tree constraints, constraint density, and a brief qualitative indicator of ambiguity arising from the constrained-language formulation. While the blueprints were chosen to reflect typical early-stage scoping artifacts (hierarchies with moderate constraints), these metrics will allow readers to assess representativeness directly and will support the generalizability discussion without changing the experimental design or results. revision: yes
Referee: [Results] Results and abstract: the reported accuracy figures are presented without accompanying statistical tests, full prompt-engineering details, or a complete per-operation/per-model error breakdown, which weakens the evidential support for the headline performance numbers and the identified systematic errors.

Authors: We acknowledge that the current presentation of results is not fully supported by statistical analysis or exhaustive breakdowns. We will revise the Results section and abstract to include (1) statistical tests (paired Wilcoxon signed-rank tests and 95% confidence intervals) comparing LLM accuracies to the FLAMA oracle, (2) expanded prompt-engineering details (exact templates, few-shot examples, and temperature settings) placed in an appendix, and (3) a complete per-operation and per-model error breakdown table that quantifies the frequency of structural-parsing versus constraint-reasoning errors. These additions will strengthen the evidential basis for the 88-89% accuracy claim and the error taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical evaluation

full rationale

The paper is a straightforward empirical study: it selects 12 blueprints and 16 standard analysis operations, runs 12 LLMs on them, and measures accuracy by direct comparison to outputs from the independent external solver FLAMA. No equations, derivations, fitted parameters, or predictions are presented whose results reduce to the inputs by construction. Central claims (e.g., 88-89% accuracy for reasoning-optimized models) are computed from these oracle matches without self-definitional loops, self-citation load-bearing premises, or ansatzes smuggled via prior work. The evaluation is self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the 16 operations and the assumption that textual blueprints serve as valid early proxies for formal feature models; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 16 analysis operations are standard and representative for feature model validation.
Invoked as the benchmark set without derivation or justification in the abstract.

pith-pipeline@v0.9.0 · 5445 in / 1179 out tokens · 51170 ms · 2026-05-09T23:56:20.744650+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

All models overview - Anthropic — docs.anthropic.com

2025. All models overview - Anthropic — docs.anthropic.com. Retrieved Sep- tember 29, 2025 from https://docs.claude.com/en/docs/about-claude/models/o verview#model-comparison-table

work page 2025
[2]

S. Apel, D. Batory, C. Kästner, and G. Saake. 2013.Feature-Oriented Software Product Lines: Concepts and Implementation. Springer

work page 2013
[3]

Becker, R

M. Becker, R. Rabiser, and G. Botterweck. 2024. Not quite there yet: remaining challenges in systems and software product line engineering as perceived by industry practitioners. InProceedings of the 28th ACM International Systems and Software Product Line Conference(SPLC ’24). ACM, Dommeldange, Luxembourg, 179–190. doi:10.1145/3646548.3672587

work page doi:10.1145/3646548.3672587 2024
[4]

Benavides, A

D. Benavides, A. Felfernig, J. Galindo, and F. Reinfrank. 2013. Automated Analysis in Feature Modelling and Product Configuration. InICSR’13(LNCS) number 7925. Springer, Pisa, Italy, 160–175

work page 2013
[5]

Benavides, S

D. Benavides, S. Segura, and A. Ruiz-Cortes. 2010. Automated analysis of feature models 20 years later: A literature review.Inf. Sys., 35, 615–636, 6

work page 2010
[6]

Benavides, C

D. Benavides, C. Sundermann, K. Feichtinger, J.A. Galindo, R. Rabiser, and T. Thüm. 2025. UVL: feature modelling with the universal variability language. Journal of Systems and Software, 225, 112326. doi:https://doi.org/10.1016/j.jss.2 024.112326

work page doi:10.1016/j.jss.2 2025
[7]

Berger, J.-P

T. Berger, J.-P. Steghöfer, T. Ziadi, J. Robin, and J. Martinez. 2020. The state of adoption and the challenges of systematic variability management in industry. Empirical Software Engineering, 25, 3, (May 2020), 1755–1797. doi:10.1007/s106 64-019-09787-6

work page doi:10.1007/s106 2020
[8]

Brown et al

T. Brown et al. 2020. Language models are few-shot learners. InAdvances in Neural Information Processing Systems. H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, (Eds.) Vol. 33. Curran Associates, Inc., 1877–1901

work page 2020
[9]

Clements and L

P.C. Clements and L. Northrop. 2002.Software product lines. Addison-wesley

work page 2002
[10]

Models & Pricing | DeepSeek API Docs — api-docs.deepseek.com

2025. Models & Pricing | DeepSeek API Docs — api-docs.deepseek.com. Re- trieved September 29, 2025 from https://api-docs.deepseek.com/quick_start/pr icing/

work page 2025
[11]

J. A. Galindo, A. J. Dominguez, J. White, and D. Benavides. 2023. Large language models to generate meaningful feature model instances. InProceedings of the 27th ACM International Systems and Software Product Line Conference - Volume A(SPLC ’23). ACM, Tokyo, Japan, 15–26. doi:10.1145/3579027.3608973

work page doi:10.1145/3579027.3608973 2023
[12]

J. A. Galindo, J.-M. Horcas, A. Felferning, D. Fernandez-Amoros, and D. Be- navides. 2023. Flama: a collaborative effort to build a new framework for the automated analysis of feature models. InProceedings of the 27th ACM Inter- national Systems and Software Product Line Conference - Volume B(SPLC ’23). ACM, Tokyo, Japan, 16–19. doi:10.1145/3579028.3609008

work page doi:10.1145/3579028.3609008 2023
[13]

Galindo, D

J.A. Galindo, D. Benavides, P. Trinidad, A. Gutiérrez-Fernández, and A. Ruiz- Cortés. 2019. Automated Analysis of Feature Models: Quo Vadis? In23rd Inter- national Systems and Software Product Line Conference - Volume A(SPLC ’19). ACM, Paris, France, 302. doi:10.1145/3336294.3342373

work page doi:10.1145/3336294.3342373 2019
[14]

Ghosh, D

S. Ghosh, D. Elenius, W. Li, P. Lincoln, N. Shankar, and W. Steiner. 2016. Arse- nal: automatic requirements specification extraction from natural language. InNASA Formal Methods. S. Rayadurgam and O. Tkachuk, (Eds.) Springer International Publishing, Cham, 41–46

work page 2016
[15]

Gemini models | Gemini API | Google AI for Developers — ai.google.dev

2025. Gemini models | Gemini API | Google AI for Developers — ai.google.dev. Retrieved September 29, 2025 from https://ai.google.dev/gemini-api/docs/mod els

work page 2025
[16]

L. Hotz, C. Bähnisch, S. Lubos, A. Felfernig, A., and J. Twiefel. 2024. Exploiting large language models for the automated generation of constraint satisfaction problems.26th International Workshop on Configuration, Conf WS 2024. CEUR Workshop Proceedings, 3812, 91–100

work page 2024
[17]

Huang and K

J. Huang and K. C.-C. Chang. 2023. Towards reasoning in large language models: a survey. InFindings of the Association for Computational Linguistics: ACL 2023. A. Rogers, J. Boyd-Graber, and N. Okazaki, (Eds.) Association for Computational Linguistics, Toronto, Canada, (July 2023), 1049–1065. doi:10.18 653/v1/2023.findings-acl.67

work page 2023
[18]

2023 , month =

A. Ishay, Z. Yang, and J. Lee. 2023. Leveraging Large Language Models to Gen- erate Answer Set Programs. InProceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning. (Aug. 2023), 374–383. doi:10.24963/kr.2023/37

work page doi:10.24963/kr.2023/37 2023
[19]

K. Kang, S. Cohen, J. Hess, W. Novak, and S. Peterson. 1990. Feature-oriented Domain Analysis (FODA) – Feasibility Study.Tech.Rep. – SEI-90-TR-21

work page 1990
[20]

Khor and R

C. Khor and R. R. Lutz. 2024. Enhancing the requirements engineering of configurable systems by the ongoing use of variability models.Requirements Engineering, 29, 3, (Sept. 2024), 303–328. doi:10.1007/s00766-024-00421-6

work page doi:10.1007/s00766-024-00421-6 2024
[21]

Kojima, S

T. Kojima, S. (S.) Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. 2022. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, (Eds.) Vol. 35. Curran Associates, Inc., 22199–22213

work page 2022
[22]

LangChain. 2025. GitHub - langchain-ai/langchain: Build context-aware rea- soning applications — github.com. Retrieved April 20, 2025 from https://github .com/langchain-ai/langchain

work page 2025
[23]

LangChain. 2025. GitHub - langchain-ai/langgraph: Build resilient language agents as graphs. — github.com. Retrieved April 20, 2025 from https://github.c om/langchain-ai/langgraph

work page 2025
[24]

B. Y. Lin, R. Le Bras, K. Richardson, A. Sabharwal, R. Poovendran, P. Clark, and Y. Choi. 2025. Zebralogic: on the scaling limits of LLMs for logical reasoning. InForty-second International Conference on Machine Learning

work page 2025
[25]

Marchezan, E

L. Marchezan, E. Rodrigues, W. K. G. Assunção, M. Bernardino, F. P. Basso, and J. Carbonell. 2022. Software product line scoping: a systematic literature review. InProceedings of the 26th ACM International Systems and Software Product Line Conference - Volume A(SPLC ’22). ACM, Graz, Austria, 256. doi:10.1145/354693 2.3547012

work page doi:10.1145/354693 2022
[26]

Michailidis, D

K. Michailidis, D. Tsouros, and T. Guns. 2024. Constraint Modelling with LLMs Using In-Context Learning. In30th International Conference on Principles and Practice of Constraint Programming (CP 2024)(Leibniz International Proceedings in Informatics (LIPIcs)). Vol. 307. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 20:1–20:27. doi:1...

work page doi:10.4230/lipics.cp.2024.20 2024
[27]

OpenAI models

2025. OpenAI models. Retrieved September 29, 2025 from https://platform.ope nai.com/docs/models

work page 2025
[28]

Llama 4 Scout - API, Providers, Stats — openrouter.ai

2025. Llama 4 Scout - API, Providers, Stats — openrouter.ai. Retrieved September 29, 2025 from https://openrouter.ai/meta-llama/llama-4-scout

work page 2025
[29]

Llama 4 Maverick - API, Providers, Stats — openrouter.ai

2025. Llama 4 Maverick - API, Providers, Stats — openrouter.ai. Retrieved September 29, 2025 from https://openrouter.ai/meta-llama/llama-4-maverick

work page 2025
[30]

L. Pan, V. Ganesh, J. Abernethy, C. Esposo, and W. Lee. 2025. Can transformers reason logically? a study in SAT solving. InForty-second International Confer- ence on Machine Learning

work page 2025
[31]

L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

M. Parmar, N. Patel, N. Varshney, M. Nakamura, M. Luo, S. Mashetty, A. Mitra, and C. Baral. 2024. Logicbench: towards systematic evaluation of logical rea- soning ability of large language models. InACL (1), 13679–13707. https://doi.o rg/10.18653/v1/2024.acl-long.739

work page doi:10.18653/v1/2024.acl-long.739 2024
[32]

K. Pohl, G. Böckle, and F. J. van der Linden. 2010.Software Product Line Engi- neering: Foundations, Principles and Techniques. (1st ed.). Springer Publishing Company, Incorporated

work page 2010
[33]

Romero-Organvidez, J

D. Romero-Organvidez, J. A. Galindo, C. Sundermann, J.-M. Horcas, and D. Benavides. 2024. Uvlhub: a feature model data repository using uvl and open science principles.Journal of Systems and Software, 216, 112150. doi:https://doi .org/10.1016/j.jss.2024.112150

work page doi:10.1016/j.jss.2024.112150 2024
[34]

Sundermann, V

C. Sundermann, V. F. Brancaccio, E. Kuiter, S. Krieter, T. Heß, and T. Thüm

work page
[35]

InProceedings of the 28th ACM International Systems and Software Product Line Conference(SPLC ’24)

Collecting feature models from the literature: a comprehensive dataset for benchmarking. InProceedings of the 28th ACM International Systems and Software Product Line Conference(SPLC ’24). ACM, Dommeldange, Luxembourg, 54–65. doi:10.1145/3646548.3672590

work page doi:10.1145/3646548.3672590
[36]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter b., F. Xia, E. Chi, Q. V. Le, and D. Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, (Eds.) Vol. 35. Curran Associates, Inc., 24824–24837

work page 2022
[37]

XAI. 2025. Models and Pricing | xAI Docs — docs.x.ai. Retrieved September 29, 2025 from https://docs.x.ai/docs/models

work page 2025
[38]

J. Yan, C. Wang, J. Huang, and W. Zhang. 2024. Do large language models understand logic or just mimick context?CoRR, abs/2402.12091. https://doi.org /10.48550/arXiv.2402.12091

work page doi:10.48550/arxiv.2402.12091 2024

[1] [1]

All models overview - Anthropic — docs.anthropic.com

2025. All models overview - Anthropic — docs.anthropic.com. Retrieved Sep- tember 29, 2025 from https://docs.claude.com/en/docs/about-claude/models/o verview#model-comparison-table

work page 2025

[2] [2]

S. Apel, D. Batory, C. Kästner, and G. Saake. 2013.Feature-Oriented Software Product Lines: Concepts and Implementation. Springer

work page 2013

[3] [3]

Becker, R

M. Becker, R. Rabiser, and G. Botterweck. 2024. Not quite there yet: remaining challenges in systems and software product line engineering as perceived by industry practitioners. InProceedings of the 28th ACM International Systems and Software Product Line Conference(SPLC ’24). ACM, Dommeldange, Luxembourg, 179–190. doi:10.1145/3646548.3672587

work page doi:10.1145/3646548.3672587 2024

[4] [4]

Benavides, A

D. Benavides, A. Felfernig, J. Galindo, and F. Reinfrank. 2013. Automated Analysis in Feature Modelling and Product Configuration. InICSR’13(LNCS) number 7925. Springer, Pisa, Italy, 160–175

work page 2013

[5] [5]

Benavides, S

D. Benavides, S. Segura, and A. Ruiz-Cortes. 2010. Automated analysis of feature models 20 years later: A literature review.Inf. Sys., 35, 615–636, 6

work page 2010

[6] [6]

Benavides, C

D. Benavides, C. Sundermann, K. Feichtinger, J.A. Galindo, R. Rabiser, and T. Thüm. 2025. UVL: feature modelling with the universal variability language. Journal of Systems and Software, 225, 112326. doi:https://doi.org/10.1016/j.jss.2 024.112326

work page doi:10.1016/j.jss.2 2025

[7] [7]

Berger, J.-P

T. Berger, J.-P. Steghöfer, T. Ziadi, J. Robin, and J. Martinez. 2020. The state of adoption and the challenges of systematic variability management in industry. Empirical Software Engineering, 25, 3, (May 2020), 1755–1797. doi:10.1007/s106 64-019-09787-6

work page doi:10.1007/s106 2020

[8] [8]

Brown et al

T. Brown et al. 2020. Language models are few-shot learners. InAdvances in Neural Information Processing Systems. H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, (Eds.) Vol. 33. Curran Associates, Inc., 1877–1901

work page 2020

[9] [9]

Clements and L

P.C. Clements and L. Northrop. 2002.Software product lines. Addison-wesley

work page 2002

[10] [10]

Models & Pricing | DeepSeek API Docs — api-docs.deepseek.com

2025. Models & Pricing | DeepSeek API Docs — api-docs.deepseek.com. Re- trieved September 29, 2025 from https://api-docs.deepseek.com/quick_start/pr icing/

work page 2025

[11] [11]

J. A. Galindo, A. J. Dominguez, J. White, and D. Benavides. 2023. Large language models to generate meaningful feature model instances. InProceedings of the 27th ACM International Systems and Software Product Line Conference - Volume A(SPLC ’23). ACM, Tokyo, Japan, 15–26. doi:10.1145/3579027.3608973

work page doi:10.1145/3579027.3608973 2023

[12] [12]

J. A. Galindo, J.-M. Horcas, A. Felferning, D. Fernandez-Amoros, and D. Be- navides. 2023. Flama: a collaborative effort to build a new framework for the automated analysis of feature models. InProceedings of the 27th ACM Inter- national Systems and Software Product Line Conference - Volume B(SPLC ’23). ACM, Tokyo, Japan, 16–19. doi:10.1145/3579028.3609008

work page doi:10.1145/3579028.3609008 2023

[13] [13]

Galindo, D

J.A. Galindo, D. Benavides, P. Trinidad, A. Gutiérrez-Fernández, and A. Ruiz- Cortés. 2019. Automated Analysis of Feature Models: Quo Vadis? In23rd Inter- national Systems and Software Product Line Conference - Volume A(SPLC ’19). ACM, Paris, France, 302. doi:10.1145/3336294.3342373

work page doi:10.1145/3336294.3342373 2019

[14] [14]

Ghosh, D

S. Ghosh, D. Elenius, W. Li, P. Lincoln, N. Shankar, and W. Steiner. 2016. Arse- nal: automatic requirements specification extraction from natural language. InNASA Formal Methods. S. Rayadurgam and O. Tkachuk, (Eds.) Springer International Publishing, Cham, 41–46

work page 2016

[15] [15]

Gemini models | Gemini API | Google AI for Developers — ai.google.dev

2025. Gemini models | Gemini API | Google AI for Developers — ai.google.dev. Retrieved September 29, 2025 from https://ai.google.dev/gemini-api/docs/mod els

work page 2025

[16] [16]

L. Hotz, C. Bähnisch, S. Lubos, A. Felfernig, A., and J. Twiefel. 2024. Exploiting large language models for the automated generation of constraint satisfaction problems.26th International Workshop on Configuration, Conf WS 2024. CEUR Workshop Proceedings, 3812, 91–100

work page 2024

[17] [17]

Huang and K

J. Huang and K. C.-C. Chang. 2023. Towards reasoning in large language models: a survey. InFindings of the Association for Computational Linguistics: ACL 2023. A. Rogers, J. Boyd-Graber, and N. Okazaki, (Eds.) Association for Computational Linguistics, Toronto, Canada, (July 2023), 1049–1065. doi:10.18 653/v1/2023.findings-acl.67

work page 2023

[18] [18]

2023 , month =

A. Ishay, Z. Yang, and J. Lee. 2023. Leveraging Large Language Models to Gen- erate Answer Set Programs. InProceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning. (Aug. 2023), 374–383. doi:10.24963/kr.2023/37

work page doi:10.24963/kr.2023/37 2023

[19] [19]

K. Kang, S. Cohen, J. Hess, W. Novak, and S. Peterson. 1990. Feature-oriented Domain Analysis (FODA) – Feasibility Study.Tech.Rep. – SEI-90-TR-21

work page 1990

[20] [20]

Khor and R

C. Khor and R. R. Lutz. 2024. Enhancing the requirements engineering of configurable systems by the ongoing use of variability models.Requirements Engineering, 29, 3, (Sept. 2024), 303–328. doi:10.1007/s00766-024-00421-6

work page doi:10.1007/s00766-024-00421-6 2024

[21] [21]

Kojima, S

T. Kojima, S. (S.) Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. 2022. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, (Eds.) Vol. 35. Curran Associates, Inc., 22199–22213

work page 2022

[22] [22]

LangChain. 2025. GitHub - langchain-ai/langchain: Build context-aware rea- soning applications — github.com. Retrieved April 20, 2025 from https://github .com/langchain-ai/langchain

work page 2025

[23] [23]

LangChain. 2025. GitHub - langchain-ai/langgraph: Build resilient language agents as graphs. — github.com. Retrieved April 20, 2025 from https://github.c om/langchain-ai/langgraph

work page 2025

[24] [24]

B. Y. Lin, R. Le Bras, K. Richardson, A. Sabharwal, R. Poovendran, P. Clark, and Y. Choi. 2025. Zebralogic: on the scaling limits of LLMs for logical reasoning. InForty-second International Conference on Machine Learning

work page 2025

[25] [25]

Marchezan, E

L. Marchezan, E. Rodrigues, W. K. G. Assunção, M. Bernardino, F. P. Basso, and J. Carbonell. 2022. Software product line scoping: a systematic literature review. InProceedings of the 26th ACM International Systems and Software Product Line Conference - Volume A(SPLC ’22). ACM, Graz, Austria, 256. doi:10.1145/354693 2.3547012

work page doi:10.1145/354693 2022

[26] [26]

Michailidis, D

K. Michailidis, D. Tsouros, and T. Guns. 2024. Constraint Modelling with LLMs Using In-Context Learning. In30th International Conference on Principles and Practice of Constraint Programming (CP 2024)(Leibniz International Proceedings in Informatics (LIPIcs)). Vol. 307. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 20:1–20:27. doi:1...

work page doi:10.4230/lipics.cp.2024.20 2024

[27] [27]

OpenAI models

2025. OpenAI models. Retrieved September 29, 2025 from https://platform.ope nai.com/docs/models

work page 2025

[28] [28]

Llama 4 Scout - API, Providers, Stats — openrouter.ai

2025. Llama 4 Scout - API, Providers, Stats — openrouter.ai. Retrieved September 29, 2025 from https://openrouter.ai/meta-llama/llama-4-scout

work page 2025

[29] [29]

Llama 4 Maverick - API, Providers, Stats — openrouter.ai

2025. Llama 4 Maverick - API, Providers, Stats — openrouter.ai. Retrieved September 29, 2025 from https://openrouter.ai/meta-llama/llama-4-maverick

work page 2025

[30] [30]

L. Pan, V. Ganesh, J. Abernethy, C. Esposo, and W. Lee. 2025. Can transformers reason logically? a study in SAT solving. InForty-second International Confer- ence on Machine Learning

work page 2025

[31] [31]

L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

M. Parmar, N. Patel, N. Varshney, M. Nakamura, M. Luo, S. Mashetty, A. Mitra, and C. Baral. 2024. Logicbench: towards systematic evaluation of logical rea- soning ability of large language models. InACL (1), 13679–13707. https://doi.o rg/10.18653/v1/2024.acl-long.739

work page doi:10.18653/v1/2024.acl-long.739 2024

[32] [32]

K. Pohl, G. Böckle, and F. J. van der Linden. 2010.Software Product Line Engi- neering: Foundations, Principles and Techniques. (1st ed.). Springer Publishing Company, Incorporated

work page 2010

[33] [33]

Romero-Organvidez, J

D. Romero-Organvidez, J. A. Galindo, C. Sundermann, J.-M. Horcas, and D. Benavides. 2024. Uvlhub: a feature model data repository using uvl and open science principles.Journal of Systems and Software, 216, 112150. doi:https://doi .org/10.1016/j.jss.2024.112150

work page doi:10.1016/j.jss.2024.112150 2024

[34] [34]

Sundermann, V

C. Sundermann, V. F. Brancaccio, E. Kuiter, S. Krieter, T. Heß, and T. Thüm

work page

[35] [35]

InProceedings of the 28th ACM International Systems and Software Product Line Conference(SPLC ’24)

Collecting feature models from the literature: a comprehensive dataset for benchmarking. InProceedings of the 28th ACM International Systems and Software Product Line Conference(SPLC ’24). ACM, Dommeldange, Luxembourg, 54–65. doi:10.1145/3646548.3672590

work page doi:10.1145/3646548.3672590

[36] [36]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter b., F. Xia, E. Chi, Q. V. Le, and D. Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, (Eds.) Vol. 35. Curran Associates, Inc., 24824–24837

work page 2022

[37] [37]

XAI. 2025. Models and Pricing | xAI Docs — docs.x.ai. Retrieved September 29, 2025 from https://docs.x.ai/docs/models

work page 2025

[38] [38]

J. Yan, C. Wang, J. Huang, and W. Zhang. 2024. Do large language models understand logic or just mimick context?CoRR, abs/2402.12091. https://doi.org /10.48550/arXiv.2402.12091

work page doi:10.48550/arxiv.2402.12091 2024