An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small Language Models
Pith reviewed 2026-05-13 20:18 UTC · model grok-4.3
The pith
Small language models for prompt-driven unit test generation display distinct profiles trading energy use and speed against coverage and stability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study empirically examines the environmental and performance tradeoffs of SLMs using the HumanEval benchmark and adaptive prompt variants. The analysis uses CodeCarbon to characterize energy consumption, carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions.
What carries the argument
CodeCarbon tracking of energy, emissions, and duration during prompt-driven test script generation by 2B-8B parameter SLMs on HumanEval, using adaptive Anthropic-style prompts and coverage as quality proxy.
If this is right
- Model selection for test generation can be guided by whether a task prioritizes low energy use or higher coverage stability.
- Prompt structure interacts with model choice to shape both environmental cost and test outcomes.
- SLMs can serve as lower-impact alternatives to larger models for routine automated testing tasks.
Where Pith is reading between the lines
- Energy-aware test generation pipelines could dynamically switch among SLMs based on current sustainability targets.
- The observed profiles suggest value in repeating the measurements on full project repositories rather than isolated benchmarks.
- These tradeoffs could combine with other green software practices to reduce overall emissions in continuous integration.
Load-bearing premise
That coverage of the generated unit tests is a sufficient proxy for their actual quality and that the controlled CodeCarbon measurements generalize to real-world usage.
What would settle it
A follow-up experiment on large real-world codebases that measures actual defect detection rates against energy consumed; results would falsify the claim if coverage shows no reliable link to effectiveness.
Figures
read the original abstract
The increasing use of language models in automated test script generation raises concerns about their environmental impact, yet existing sustainability analyses focus predominantly on large language models. As a result, the energy and carbon characteristics of small language models (SLMs) during prompt-driven unit-test script generation remain largely unexplored. To address this gap, this study empirically examines the environmental and performance tradeoffs of SLMs (in the 2B-8B parameter range) using the HumanEval benchmark and adaptive prompt variants (based on the Anthropic template). The analysis uses CodeCarbon to characterize energy consumption carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions. Overall, this work provides focused empirical evidence on sustainable SLM-based test script generation, clarifying how prompt structure and model selection jointly shape environmental and performance outcomes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical study of sustainability in prompt-driven unit-test script generation using small language models (SLMs in the 2B-8B range). It evaluates energy consumption, carbon emissions, and execution duration via CodeCarbon on the HumanEval benchmark with adaptive prompts derived from the Anthropic template, treating unit-test coverage as an initial proxy for generated test quality. The central claim is that different SLMs exhibit distinct sustainability profiles, with some favoring lower energy use and faster execution while others maintain higher stability or coverage under comparable conditions.
Significance. If the empirical measurements hold under scrutiny, the work is significant for providing focused data on the environmental and performance tradeoffs of SLMs in software engineering tasks, a gap left by prior studies centered on larger models. The controlled experimental setup with CodeCarbon supports reproducibility, and the emphasis on prompt structure and model selection offers practical guidance for sustainable test automation. Strengths include the direct measurement of energy/emissions alongside coverage, which could inform model choice in resource-constrained settings.
major comments (1)
- [Abstract and results] Abstract and results discussion: The headline claim of distinct sustainability profiles (balancing energy, speed, stability, and coverage) depends on unit-test coverage serving as a sufficient proxy for test quality. Coverage counts executed lines on HumanEval but does not verify semantic correctness, assertion strength, or fault-detection power. If models produce scripts with similar coverage yet differing brittleness or non-assertive tests, the reported coverage or stability advantages cannot be interpreted as quality advantages, weakening the joint claim that prompt structure and model choice shape meaningful environmental-performance outcomes.
minor comments (2)
- [Abstract] The abstract states that coverage serves as an 'initial proxy' but provides no error bars, statistical tests, or full methodological details on measurement (e.g., exact CodeCarbon configuration, prompt adaptation steps, or number of runs). Adding these would strengthen the empirical presentation without altering the core claim.
- [Methodology] Minor notation issue: The description of 'adaptive prompt variants' lacks a concrete example or table showing the base Anthropic template versus the adapted versions used for each SLM, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps us clarify the scope and limitations of our empirical study. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and results] Abstract and results discussion: The headline claim of distinct sustainability profiles (balancing energy, speed, stability, and coverage) depends on unit-test coverage serving as a sufficient proxy for test quality. Coverage counts executed lines on HumanEval but does not verify semantic correctness, assertion strength, or fault-detection power. If models produce scripts with similar coverage yet differing brittleness or non-assertive tests, the reported coverage or stability advantages cannot be interpreted as quality advantages, weakening the joint claim that prompt structure and model choice shape meaningful environmental-performance outcomes.
Authors: We acknowledge that unit-test coverage is an imperfect and preliminary proxy that does not capture semantic correctness, assertion strength, or fault-detection effectiveness. Our manuscript already describes coverage as an 'initial proxy' and frames the sustainability profiles strictly in terms of the measured trade-offs among energy consumption, carbon emissions, execution duration, and coverage rates (with stability defined as consistency across repeated runs). The central claims concern these observable metrics rather than asserting superior test quality. Nevertheless, we agree that the current wording risks over-interpretation. We will revise the abstract, results discussion, and limitations section to more explicitly state the boundaries of coverage as a quality indicator and to emphasize that the reported profiles reflect environmental and performance characteristics under this proxy, not comprehensive test effectiveness. This revision will be incorporated in the next version of the manuscript. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper conducts controlled experiments measuring energy use, carbon emissions, runtime, and unit-test coverage for SLMs on HumanEval using CodeCarbon and adaptive prompts. No equations, derivations, fitted parameters, or predictions appear in the provided text or abstract. Results are reported as direct observations of distinct sustainability profiles without any self-referential reduction where outputs are defined by or fitted to the same inputs. Coverage serves as an explicit proxy for quality but is not derived from or equated to the sustainability metrics by construction. This matches the reader's 0.0 assessment; the study is self-contained empirical work with no load-bearing self-citations or ansatzes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Tom Cappendijk, Pepijn de Reus, and Ana Oprescu. 2025. An exploration of prompting LLMs to generate energy-efficient code. InProceedings of the 47th International Conference on Software Engineering: GREENs Track. ACM, Ottawa, Canada. doi:10.48550/arXiv.2405.05097
work page internal anchor Pith review doi:10.48550/arxiv.2405.05097 2025
- [2]
-
[3]
Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. ACM, 416–419
work page 2011
-
[4]
Green Software Foundation. 2021. Green software foundation website.Green Software Foundation(2021). Available online: https://greensoftware.foundation
work page 2021
-
[5]
Vishakh H. Iyer, Ravi Pathak, Achyudh Sridhar, Varsha Apte, Raghuram Gopalakr- ishnan, Deepa Kalathil, and Parthasarathy Ranganathan. 2023. Carbon-aware large language model inference in the cloud.arXiv preprint arXiv:2309.08101 (2023)
-
[6]
Saurabh Kapoor. 2024. Green software quality: A comprehensive framework for sustainable metrics in software development.International Journal of Computer Trends and Technology72, 10 (2024), 113–120. doi:10.14445/22312803/IJCTT- V72I10P118
-
[7]
Patricia Lago, Qing Gu, and Paolo Bozzelli. 2014. A systematic literature re- view of green software metrics.VU University Amsterdam Technical Report (2014). https://research.vu.nl/en/publications/a-systematic-literature-review-of- green-software-metrics
work page 2014
-
[8]
Yuhan Li, Rong Chen, Mayank Gupta, and Dinesh Rao. 2023. Energy consumption patterns in automated test script generation. InProceedings of the 45th Interna- tional Conference on Software Engineering (ICSE). IEEE, 845–856
work page 2023
- [9]
-
[10]
Mourão, Leila Karita, and Ivan do Carmo Machado
Brunna C. Mourão, Leila Karita, and Ivan do Carmo Machado. 2018. Green and sustainable software engineering: A systematic mapping study. InProceedings of the SBQS. ACM, 1–10. doi:10.1145/3275245.3275258
-
[11]
Ana Oprescu, Pepijn Reus, and Tom Cappendijk. 2023. Prompt engineering for energy efficiency: Lessons from code generation. InProceedings of the 2023 International Conference on Green Software. ACM
work page 2023
-
[12]
Hannah Rashkin, Maarten Sap, Maxwell Forbes, and Yejin Choi. 2021. Words to watts: A benchmark for measuring the energy efficiency of NLP models and tasks. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 10000–10015
work page 2021
-
[13]
Ritesh Sharma, Neha Batra, and Xin Liu. 2023. Sustainable test automation: Quantifying the environmental impact of code testing with SLMs.Journal of Software Sustainability15, 3 (2023), 225–239
work page 2023
-
[14]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 3645–3650
work page 2019
-
[15]
Tina Vartziotis, Ippolyti Dellatolas, George Dasoulas, Maximilian Schmidt, Florian Schneider, Tim Hoffmann, Sotirios Kotsopoulos, and Michael Keckeisen. 2024. Learn to code sustainably: An empirical study on green code generation. In Proceedings of the 2024 International Workshop on Large Language Models for Code (LLM4Code). ACM, 1–8. doi:10.1145/3643795.3648394
-
[16]
Roberto Verdecchia, Patricia Lago, Christof Ebert, and Carol de Vries. 2021. Green IT and green software.IEEE Software38, 6 (2021), 7–15. doi:10.1109/MS.2021. 3102254
- [17]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.