pith. machine review for the scientific record. sign in

arxiv: 2602.17697 · v2 · submitted 2026-02-06 · 💻 cs.LG · cs.SE

Recognition: no theorem link

Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:42 UTC · model grok-4.3

classification 💻 cs.LG cs.SE
keywords LLM inferencevariability modelinghyperparameter tuningenergy efficiencypredictive modelsconfiguration managementtrade-off analysis
0
0 comments X

The pith

A feature-based variability model captures LLM inference hyperparameter interactions to predict energy, latency, and accuracy from limited measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies variability management techniques from software engineering to the large configuration space of LLM inference. It represents generation hyperparameters and their constraints using a feature-based variability model, samples representative configurations from the Hugging Face Transformers library, and measures energy consumption, latency, and accuracy. Predictive models are then learned from this data to analyze effects, interactions, and trade-offs. This matters because the combinatorial explosion makes exhaustive evaluation impossible, so the method supports efficient and sustainable configuration of LLMs by enabling predictions from few measurements.

Core claim

Treating LLMs as configurable systems, a feature-based variability model is used to represent generation hyperparameters and constraints. Representative configurations are sampled, their energy consumption, latency, and accuracy are measured, and predictive models are learned that enable systematic analysis of hyperparameter effects and interactions, reveal trade-offs, and support prediction of inference behavior from a limited number of measurements.

What carries the argument

Feature-based variability model that represents generation hyperparameters and constraints for sampling and predictive modeling of inference metrics.

If this is right

  • Systematic analysis of hyperparameter effects and interactions becomes feasible without exhaustive testing.
  • Trade-offs between energy efficiency, latency, and accuracy can be identified and managed.
  • Inference behavior for new configurations can be predicted from a small number of measurements.
  • This supports more sustainable LLM deployment by optimizing settings efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could extend to other resource-heavy ML tasks such as training or non-LLM inference to manage their configuration complexity.
  • It might combine with automated search methods to reduce reliance on manual feature modeling.
  • Hardware changes would likely require retraining the predictive models to keep forecasts accurate.
  • Runtime adaptation of settings could use the predictions to respond to varying workloads.

Load-bearing premise

A feature-based variability model can accurately capture all relevant constraints and interactions among generation hyperparameters without missing important real-world behaviors or requiring excessive manual effort.

What would settle it

Measuring energy, latency, and accuracy on a new set of configurations outside the sampled ones and finding large prediction errors would show that the models fail to generalize.

Figures

Figures reproduced from arXiv: 2602.17697 by Cl\'ement Quinton, Nada Zine, Romain Rouvoy.

Figure 1
Figure 1. Figure 1: Overview of the proposed approach. To address this challenge, we represent LLM inference variability using FMs, the most widely adopted formalism for modeling con￾figurable software systems [9, 20, 46]. A FM captures variability in terms of features, which can be mandatory, optional, or organized into feature groups, and expresses dependencies through cross-tree constraints (e.g., selecting one feature req… view at source ↗
Figure 2
Figure 2. Figure 2: The variability of Hugging Face Transformers explored in this paper. Cross-tree constraints are partially shown. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative Distribution Function (CDF) of energy consumption, performance, and accuracy across different configurations for each evaluated LLM. latency, and accuracy, as identified through variability modeling? [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature-wise analysis of energy consumption per [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Feature-wise analysis of energy consumption and accuracy across decoding implementations. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pareto front illustrating the trade-offs between en [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are being increasingly used across a wide range of tasks. However, their substantial computational demands raise concerns about the energy efficiency and sustainability of both training and inference. Inference, in particular, dominates total compute usage, making its optimization crucial. Recent research has explored optimization techniques and analyzed how configuration choices influence energy consumption. Yet, the vast configuration space of inference servers makes exhaustive empirical evaluation infeasible due to combinatorial explosion. In this paper, we introduce a new perspective on this problem by treating LLMs as configurable systems and applying variability management techniques to systematically analyze inference-time configuration choices. We evaluate our approach on the Hugging Face Transformers library by representing generation hyperparameters and their constraints using a feature-based variability model, sampling representative configurations, measuring their energy consumption, latency, accuracy, and learning predictive models from the collected data. Our results show that variability modeling effectively manages the complexity of LLM inference configurations. It enables systematic analysis of hyperparameters effects and interactions, reveals trade-offs, and supports prediction of inference behavior from a limited number of measurements. Overall, this work opens a new research direction that bridges software engineering and machine learning by leveraging variability modeling for the efficient and sustainable configuration of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes modeling LLM inference hyperparameters (e.g., in Hugging Face Transformers) as a feature-based variability model to tame combinatorial explosion. Configurations are sampled, their energy consumption, latency, and accuracy are measured, and predictive models are trained on the resulting data to forecast inference behavior from a limited number of measurements. The central claim is that this variability-management approach enables systematic analysis of hyperparameter effects and interactions, reveals trade-offs, and supports accurate prediction without exhaustive evaluation.

Significance. If the predictive models generalize with low error on unsampled points, the work would usefully import variability-modeling techniques from software engineering into LLM configuration, offering a structured way to expose energy-accuracy-latency trade-offs and reduce measurement cost in large hyperparameter spaces.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'supports prediction of inference behavior from a limited number of measurements' is unsupported by any quantitative evidence. No cross-validation error, test-set MAE, R², or held-out generalization results are reported for the learned predictors, nor is the sampling strategy or model family described. This is load-bearing for the central contribution.
  2. [Abstract] The manuscript provides no validation that the feature model captures all relevant interactions (e.g., temperature-top-p coupling on diversity metrics) or that the sampled points adequately cover high-variance regions of the configuration space; without such checks the 'limited measurements' benefit cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract states that 'variability modeling effectively manages the complexity' but does not define the concrete feature-model notation, constraint language, or sampling algorithm used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing where additional evidence is needed and outlining the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'supports prediction of inference behavior from a limited number of measurements' is unsupported by any quantitative evidence. No cross-validation error, test-set MAE, R², or held-out generalization results are reported for the learned predictors, nor is the sampling strategy or model family described. This is load-bearing for the central contribution.

    Authors: We agree that the abstract's claim regarding prediction from limited measurements requires quantitative backing to be fully supported. The manuscript details a feature-model-based sampling approach (using combinatorial interaction testing to select representative configurations) and trains regression-based predictive models on the resulting energy, latency, and accuracy measurements. To strengthen this, we will revise the abstract to report key metrics such as cross-validation error, test-set MAE, and R² values from our held-out evaluations, along with a brief description of the model family and sampling strategy. This directly addresses the load-bearing concern without changing the core approach. revision: yes

  2. Referee: [Abstract] The manuscript provides no validation that the feature model captures all relevant interactions (e.g., temperature-top-p coupling on diversity metrics) or that the sampled points adequately cover high-variance regions of the configuration space; without such checks the 'limited measurements' benefit cannot be assessed.

    Authors: The feature model was derived from the Hugging Face Transformers API documentation and includes documented dependencies and interactions, such as the coupling between temperature and top-p that influences output diversity. Sampling employed variability-model techniques to ensure pairwise coverage of these interactions. We acknowledge that explicit post-sampling validation (e.g., quantitative checks on diversity metric variance or coverage of high-variance regions) is not detailed in the current version. We will add this analysis in the revised manuscript, including coverage statistics and interaction-effect measurements, to substantiate the benefit of limited measurements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical sampling and modeling from measured data

full rationale

The paper treats LLM inference as a configurable system, builds a feature-based variability model, samples configurations, measures energy/latency/accuracy, and trains predictive models on the resulting data. No equations, derivations, or self-citations reduce any claimed prediction to a quantity defined by the paper's own fitted parameters or inputs. The central results rest on external measurements and standard ML training rather than self-referential construction. This is a standard empirical SE/ML workflow with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM inference hyperparameters form a constrained configuration space that can be faithfully represented by a feature model and that limited samples suffice for useful prediction; no new entities are postulated.

axioms (1)
  • domain assumption LLM generation hyperparameters and their constraints can be represented using a feature-based variability model
    Invoked when the paper states it represents generation hyperparameters and constraints with a feature model.

pith-pipeline@v0.9.0 · 5515 in / 1216 out tokens · 23256 ms · 2026-05-16T06:42:18.094329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 5 internal anchors

  1. [1]

    Mathieu Acher, José Galindo Duarte, and Jean-Marc Jézéquel. 2023. On program- ming variability with large language model-based assistant. InProceedings of the 27th International Systems and Software Product Line Conference-Vol. A. 8–14

  2. [2]

    Mathieu Acher and Jabier Martinez. 2023. Generative AI for reengineering variants into software product lines: an experience report. InProceedings of the 27th International Systems and Software Product Line Conference-Vol. B. 57–66

  3. [3]

    Halimeh Agh, Aidin Azamnouri, and Stefan Wagner. 2024. Software product line testing: a systematic literature review.Empirical Software Engineering29, 6 (2024), 146

  4. [4]

    Mustafa Al-Hajjaji, Jens Meinicke, Sebastian Krieter, Reimar Schröter, Thomas Thüm, Thomas Leich, and Gunter Saake. 2016. Tool demo: testing configurable systems with featureIDE. InProceedings of the 2016 ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences. 173–177

  5. [5]

    Negar Alizadeh, Boris Belchev, Nishant Saurabh, Patricia Kelbert, and Fernando Castor. 2025. Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 725–736

  6. [6]

    Juliana Alves Pereira, Mathieu Acher, Hugo Martin, and Jean-Marc Jézéquel

  7. [7]

    InProceedings of the ACM/SPEC International Conference on Performance Engineering

    Sampling effect on performance prediction of configurable systems: A case study. InProceedings of the ACM/SPEC International Conference on Performance Engineering. 277–288

  8. [8]

    Esteban Garces Arias, Meimingwei Li, Christian Heumann, and Matthias Aßen- macher. 2025. Decoding decoded: Understanding hyperparameter effects in open-ended text generation. InProceedings of the 31st International Conference on Computational Linguistics. 9992–10020

  9. [9]

    Don Batory. 2005. Feature models, grammars, and propositional formulas. In International Conference on Software Product Lines. Springer, 7–20

  10. [10]

    Thorsten Berger, Ralf Rublack, Divya Nair, Joanne M Atlee, Martin Becker, Krzysztof Czarnecki, and Andrzej Wasowski. 2013. A survey of variability modeling in industrial practice. InProceedings of the 7th International Workshop on Variability Modelling of Software-intensive Systems. 1–8

  11. [11]

    Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32

  12. [12]

    Davide Brugali. 2020. Software product line engineering for robotics.Software Engineering for Robotics(2020), 1–28

  13. [13]

    Cécile Camillieri, Luca Parisi, Mireille Blay-Fornarino, Frédéric Precioso, Michel Riveill, and Joël Cancela-Vaz. 2016. Towards a software product line for machine learning workflows: Focus on supporting evolution. In10th Workshop on Models and Evolution co-located with ACM/IEEE 19th International Conference on Model Driven Engineering Languages and Syste...

  14. [14]

    Francisco Caravaca, Ángel Cuevas, and Rubén Cuevas. 2025. From Prompts to Power: Measuring the Energy Footprint of LLM Inference.arXiv preprint arXiv:2511.05597(2025)

  15. [15]

    Lianping Chen, Muhammad Ali Babar, and Nour Ali. 2009. Variability manage- ment in software product lines: a systematic review. (2009)

  16. [16]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

  17. [17]

    Davide Chicco, Matthijs J Warrens, and Giuseppe Jurman. 2021. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation.Peerj computer science7 (2021), e623

  18. [18]

    Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. Green My LLM: Studying the key factors affecting the energy consumption of code assistants. arXiv preprint arXiv:2411.11892(2024)

  19. [19]

    Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2025. When Faster Isn’t Greener: The Hidden Costs of LLM-Based Code Optimization. InASE’25-40th International Conference on Automated Software Engineering

  20. [20]

    Marco Couto, Paulo Borba, Jácome Cunha, João Paulo Fernandes, Rui Pereira, and João Saraiva. 2017. Products go green: Worst-case energy consumption in software product lines. InProceedings of the 21st International Systems and Software Product Line Conference-Volume A. 84–93

  21. [21]

    Krzysztof Czarnecki, Paul Grünbacher, Rick Rabiser, Klaus Schmid, and Andrzej Wasowski. 2012. Cool features and tough decisions: a comparison of variabil- ity modeling approaches. InProceedings of the 6th International workshop on variability modeling of software-intensive systems. 173–182

  22. [22]

    Krzysztof Czarnecki and Andrzej Wasowski. 2007. Feature diagrams and logics: There and back again. In11th International Software Product Line Conference (SPLC 2007). IEEE, 23–34

  23. [23]

    Jared Fernandez, Clara Na, Vashisth Tiwari, Yonatan Bisk, Sasha Luccioni, and Emma Strubell. 2025. Energy considerations of large language model inference and efficiency optimizations.arXiv preprint arXiv:2504.17674(2025)

  24. [24]

    Zhenxiao Fu, Fan Chen, Shan Zhou, Haitong Li, and Lei Jiang. 2025. Llmco2: Advancing accurate carbon footprint prediction for llm inferences.ACM SIGEN- ERGY Energy Informatics Review5, 2 (2025), 63–68

  25. [25]

    José A Galindo, David Benavides, Pablo Trinidad, Antonio-Manuel Gutiérrez- Fernández, and Antonio Ruiz-Cortés. 2019. Automated analysis of feature models: Quo vadis?Computing101, 5 (2019), 387–433

  26. [26]

    José A Galindo, Antonio J Dominguez, Jules White, and David Benavides. 2023. Large language models to generate meaningful feature model instances. In Proceedings of the 27th ACM International Systems and Software Product Line Conference-Volume A. 15–26

  27. [27]

    Javad Ghofrani, Ehsan Kozegar, Anna Lena Fehlhaber, and Mohammad Divband Soorati. 2019. Applying product line engineering concepts to deep neural net- works. InProceedings of the 23rd International Systems and Software Product Line Conference-Volume A. 72–77

  28. [28]

    Marcos Gomez-Vazquez and Jordi Cabot. 2024. Exploring the use of software product lines for the combination of machine learning models. InProceedings of the 28th ACM International Systems and Software Product Line Conference. 26–29

  29. [29]

    Édouard Guégain, Alexandre Bonvoisin, Mathieu Acher, Clément Quinton, and Romain Rouvoy. 2025. Exploring Performance of Configurable Software Systems: the JHipster Case Study. InEASE’25-29th International Conference on Evaluation and Assessment in Software Engineering

  30. [30]

    Édouard Guégain, Clément Quinton, and Romain Rouvoy. 2021. On reducing the energy consumption of software product lines. InProceedings of the 25th ACM International Systems and Software Product Line Conference-Volume A. 89–99

  31. [31]

    Jianmei Guo, Dingyu Yang, Norbert Siegmund, Sven Apel, Atrisha Sarkar, Pavel Valov, Krzysztof Czarnecki, Andrzej Wasowski, and Huiqun Yu. 2018. Data- efficient performance learning for configurable systems.Empirical Software Engineering23, 3 (2018), 1826–1867

  32. [32]

    Erik Johannes Husom, Arda Goknil, Merve Astekin, Lwin Khin Shar, Andre Kåsen, Sagar Sen, Benedikt Andreas Mithassel, and Ahmet Soylu. 2025. Sustainable llm inference for edge ai: Evaluating quantized llms for energy efficiency, output accuracy, and inference latency.arXiv preprint arXiv:2504.03360(2025)

  33. [33]

    Maxime Huyghe, Clément Quinton, and Walter Rudametkin. 2024. Taming the Variability of Browser Fingerprints. InProceedings of the 28th ACM International Systems and Software Product Line Conference (SPLC ’24). 66–71

  34. [34]

    Maxime Huyghe, Walter Rudametkin, and Clément Quinton. 2025. FP-Rainbow: Fingerprint-Based Browser Configuration Identification. InProceedings of the ACM on Web Conference 2025. 4325–4335

  35. [35]

    Akshaya Jagannadharao, Nicole Beckage, Sovan Biswas, Hilary Egan, Jamil Gafur, Thijs Metsch, Dawn Nafus, Giuseppe Raffa, and Charles Tripp. 2024. A Beginner’s Guide to Power and Energy Measurement and Estimation for Computing and Machine Learning.arXiv preprint arXiv:2412.17830(2024)

  36. [36]

    Martin Fagereng Johansen, Øystein Haugen, and Franck Fleurey. 2012. An algorithm for generating t-wise covering arrays from large feature models. In Proceedings of the 16th International Software Product Line Conference-Volume 1. 46–55

  37. [37]

    Zohra Kaouter Kebaili, Djamel Eddine Khelladi, Mathieu Acher, and Olivier Barais. 2024. An Empirical Study on Leveraging LLMs for Metamodels and Code Co-evolution. InEuropean Conference on Modelling Foundations and Applications (ECMFA 2024), Vol. 23. Journal of Object Technology, 1–14

  38. [38]

    Sebastian Krieter, Thomas Thüm, Sandro Schulze, Gunter Saake, and Thomas Leich. 2020. YASA: yet another sampling algorithm. InProceedings of the 14th International Working Conference on Variability Modelling of Software-Intensive Systems. 1–10

  39. [39]

    Elias Kuiter, Chico Sundermann, Thomas Thüm, Tobias Hess, Sebastian Krieter, and Gunter Saake. 2025. How configurable is the Linux Kernel? Analyzing two decades of feature-model history.ACM Transactions on Software Engineering and Methodology(2025)

  40. [40]

    Indika Kumara, Mohamed Hameez Ariz, Mohan Baruwal Chhetri, Majid Moham- madi, Willem-Jan Van Den Heuvel, and Damian A Tamburri. 2022. FOCloud: feature model guided performance prediction and explanation for deployment configurable cloud applications.IEEE Transactions on Services Computing16, 1 (2022), 302–314

  41. [41]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  42. [42]

    Malgorzata Lazuka, Andreea Anghel, and Thomas Parnell. 2024. Llm-pilot: Characterize and optimize performance of your llm inference services. InSC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–18

  43. [43]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2023), 21558–21572

  44. [44]

    Noelia Lopez-Duran, David Romero Organvídez, Fermín Cruz, and David Be- navides. 2025. Configuration Bugs Classification using LLMs and Encoders. In Proceedings of the 2025 29th ACM International Systems and Software Product Line Conference-Volume A. 190–200

  45. [45]

    Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2024. Power hungry process- ing: Watts driving the cost of AI deployment?. InProceedings of the 2024 ACM Nada Zine, Clément Quinton, and Romain Rouvoy conference on fairness, accountability, and transparency. 85–99

  46. [46]

    Paul Joe Maliakel, Shashikant Ilager, and Ivona Brandic. 2025. Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings.arXiv preprint arXiv:2501.08219(2025)

  47. [47]

    Maíra Marques, Jocelyn Simmonds, Pedro O Rossel, and María Cecilia Bastarrica

  48. [48]

    Software product line evolution: A systematic literature review.Information and Software Technology105 (2019), 190–208

  49. [49]

    Matias Martinez. 2025. The impact of hyperparameters on large language model inference performance: An evaluation of vllm and huggingface pipelines. InPro- ceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1672–1678

  50. [50]

    Alireza Nik, Michael A Riegler, and Pål Halvorsen. 2025. Impact of decoding strategies on GPU energy usage in large language model text generation.Scien- tific Reports(2025)

  51. [51]

    Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, and Jemin Lee. 2025. A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency.arXiv preprint arXiv:2505.01658 (2025)

  52. [52]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Íñigo Goiri, Brijesh Warrier, Nithish Mahalingam, and Ricardo Bianchini. 2024. Characterizing power man- agement opportunities for llms in the cloud. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 207–222

  53. [53]

    Juliana Alves Pereira, Mathieu Acher, Hugo Martin, Jean-Marc Jézéquel, Goetz Botterweck, and Anthony Ventresque. 2021. Learning software configuration spaces: A systematic literature review.Journal of Systems and Software182 (2021), 111044

  54. [54]

    Clément Quinton, Daniel Romero, and Laurence Duchien. 2016. SALOON: a platform for selecting and configuring cloud environments.Software: Practice and Experience46, 1 (2016), 55–78

  55. [55]

    Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From words to watts: Benchmarking the energy costs of large language model inference. In2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9

  56. [56]

    Ina Schaefer, Rick Rabiser, Dave Clarke, Lorenzo Bettini, David Benavides, Goetz Botterweck, Animesh Pathak, Salvador Trujillo, and Karina Villela. 2012. Soft- ware diversity: state of the art and perspectives.International Journal on Software Tools for Technology Transfer14 (2012), 477–495

  57. [57]

    Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. [n. d.]. A thorough examination of decoding methods in the era of llms, 2024.URL https://arxiv. org/abs/2402.06925([n. d.])

  58. [58]

    Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, and Josep Torrellas

  59. [59]

    Towards greener llms: Bringing energy-efficiency to the forefront of llm inference.arXiv preprint arXiv:2403.20306(2024)

  60. [60]

    Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse

  61. [61]

    In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)

    Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362

  62. [62]

    Johannes Stümpfle, Devansh Atray, Nasser Jazdi, and Michael Weyrich. 2025. Large language model assisted transformation of software variants into a soft- ware product line. In2025 IEEE/ACM 22nd International Conference on Software and Systems Reuse (ICSR). IEEE, 12–20

  63. [63]

    Johannes Stümpfle, Sebastian Baum, Daniel Dittler, Nasser Jazdi, and Michael Weyrich. 2024. Automating Software Product Line Adoption Based on Feature Models Using Large Language Models. In2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA). IEEE, 1–4

  64. [64]

    2017.Learning-based performance specialization of configurable systems

    Paul Temple, Mathieu Acher, Jean-Marc Jézéquel, Léo Noel-Baron, and José A Galindo. 2017.Learning-based performance specialization of configurable systems. Ph. D. Dissertation. IRISA, Inria Rennes; University of Rennes 1

  65. [65]

    Guadalupe-Isaura Trujillo-Tzanahua, Ulises Juárez-Martínez, Alberto-Alfonso Aguilar-Lasserre, María-Karen Cortés-Verdín, and Catherine Azzaro-Pantel. 2020. Multiple software product lines to configure applications of internet of things. IET Software14, 2 (2020), 165–175

  66. [66]

    Grant Wilkins, Srinivasan Keshav, and Richard Mortier. 2024. Offline energy- optimal llm serving: Workload-based energy models for llm inference on het- erogeneous systems.ACM SIGENERGY Energy Informatics Review4, 5 (2024), 113–119

  67. [67]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement De- langue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language pro- cessing.arXiv preprint arXiv:1910.03771(2019)

  68. [68]

    Yanran Wu, Inez Hua, and Yi Ding. 2025. Unveiling environmental impacts of large language model serving: A functional unit view.arXiv preprint arXiv:2502.11256(2025)

  69. [69]

    Dan Zhao, Nathan C Frey, Joseph McDonald, Matthew Hubbell, David Bestor, Michael Jones, Andrew Prout, Vijay Gadepally, and Siddharth Samsi. 2022. A green (er) world for ai. In2022 IEEE International Parallel and Distributed Process- ing Symposium Workshops (IPDPSW). IEEE, 742–750

  70. [70]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

  71. [71]

    Zibin Zheng, Kaiwen Ning, Qingyuan Zhong, Jiachi Chen, Wenqing Chen, Lianghong Guo, Weicheng Wang, and Yanlin Wang. 2025. Towards an un- derstanding of large language models in software engineering tasks.Empirical Software Engineering30, 2 (2025), 50

  72. [72]

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. 2024. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294(2024)

  73. [73]

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al

  74. [74]

    Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.Preprint arXiv:2406.15877(2024)

  75. [75]

    Nada Zine, Clément Quinton, and Romain Rouvoy. 2025. LLM-based Co- Evolution of Configurable Software Systems. InProceedings of the 2025 29th ACM International Systems and Software Product Line Conference-Volume A. 27– 38