Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference

Daria Smirnova; Hamid Nasiri; Marta Adamska; Peter Garraghan; Zhengxin Yu

arxiv: 2503.10666 · v4 · submitted 2025-03-09 · 💻 cs.CL · cs.AI· cs.LG

Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference

Marta Adamska , Daria Smirnova , Hamid Nasiri , Zhengxin Yu , Peter Garraghan This is my paper

Pith reviewed 2026-05-23 00:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM inferenceenergy consumptionprompt designsemantic meaningenergy efficiencytransformer modelsquestion answeringtext generation

0 comments

The pith

Semantic meaning and task-specific keywords affect LLM inference energy more than prompt length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how prompt and response features change energy use during inference for three open-source transformer LLMs on question answering, sentiment analysis, and text generation. Measurements track prompt length, semantic content, generation time, and energy draw for each run. Results show that the meaning carried by the prompt matters more for energy cost than its length, and that certain keywords reliably track higher or lower energy use in a task-dependent way. These patterns suggest that prompt wording itself can be adjusted to cut inference power without altering the underlying model.

Core claim

Even when given identical tasks, the models produce responses with different characteristics that produce distinct energy consumption patterns. Prompt length is less significant than the semantic meaning of the task itself. Specific keywords are associated with higher or lower energy usage, and these associations change across the three task types. The findings indicate that prompt design can be used to optimize inference efficiency.

What carries the argument

Empirical measurement of energy consumption tied to prompt length, semantic meaning, and task-linked keywords during LLM inference runs.

If this is right

Prompt wording choices can be used to reduce inference energy without changing the model or hardware.
Task-specific keywords offer a practical signal for predicting or lowering energy draw.
Energy use varies across responses even when the input task is held constant.
Prompt design becomes a controllable lever for lowering the operational cost of LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt rewriting tools could be built to suggest lower-energy versions of a given query while preserving meaning.
The same keyword effects might appear in other resource measures such as latency or memory footprint.
Energy-aware prompt templates could be developed per task type for repeated use in production systems.

Load-bearing premise

Measured energy differences can be causally attributed to prompt semantics and keywords rather than hardware variability, measurement noise, or untracked differences in model generation paths.

What would settle it

A replication that fixes hardware, repeats each prompt many times, and applies statistical tests showing no reliable energy difference linked to the reported keywords or semantic categories.

Figures

Figures reproduced from arXiv: 2503.10666 by Daria Smirnova, Hamid Nasiri, Marta Adamska, Peter Garraghan, Zhengxin Yu.

**Figure 2.** Figure 2: Energy consumption measured during inference for each model and task. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Scatter plots representing energy usage in relation to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and response characteristics directly impact LLM inference energy cost. We conduct experiments leveraging three open-source transformer-based LLMs across three task types$-$question answering, sentiment analysis, and text generation. For each inference, we analyzed prompt and response characteristics (length, semantic meaning, time taken, energy consumption). Our results demonstrate that even when presented with identical tasks, models generate responses with varying characteristics and subsequently exhibit distinct energy consumption patterns. We found that prompt length is less significant than the semantic meaning of the task itself. In addition, we identified specific keywords associated with higher or lower energy usage that vary between associated tasks. These findings highlight the importance of prompt design in optimizing inference efficiency. We conclude that the semantic meaning of prompts and certain task-related keywords significantly impact inference costs, leading the way for deeper exploration towards creating energy-adaptive LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures energy across prompts on three models and finds semantics matter more than length plus some keyword effects, but lacks controls that would isolate those from response length or generation paths.

read the letter

The main point is that this work runs energy measurements on three open-source LLMs for question answering, sentiment analysis, and text generation, then reports that prompt length is less important than task semantics and that certain keywords correlate with higher or lower energy use. The experiments track prompt and response length, time, and energy, and note that the same task can produce responses with different characteristics and costs. That matches everyday observation when running these models and gives a small set of concrete observations on a practical issue. The keyword findings are the most specific new detail here. The paper keeps the claims modest and focuses on empirical patterns rather than big theory. It is the kind of targeted measurement that can inform people who actually deploy models and want to watch their power bill. The soft spot is the missing isolation. The abstract mentions analyzing prompt and response characteristics but gives no sign of fixed output lengths, temperature zero, repeated runs with error bars, or checks that rule out systematic differences in how many tokens the model generates for different keywords. If response length or token probabilities vary with the tested semantics, the energy differences cannot be cleanly attributed to the prompt. That is the exact concern in the stress-test note, and nothing in the provided abstract pushes back against it. The work is aimed at practitioners who care about inference costs and might try prompt tweaks as a low-effort lever. It will not shift training methods or hardware, but the data could be useful if the controls are added. I would send it for peer review because the topic is relevant, the experiments use real models, and the central claim is testable once the methods are spelled out. Reviewers will likely ask for the missing controls and statistical detail, which is the right level of scrutiny for this kind of incremental empirical paper.

Referee Report

2 major / 0 minor

Summary. The paper empirically examines the effects of prompt and response characteristics (length, semantic meaning, time, energy) on inference energy costs for three open-source LLMs across question answering, sentiment analysis, and text generation tasks. It concludes that semantic meaning of the task outweighs prompt length in driving energy use and identifies task-specific keywords linked to higher or lower consumption, advocating for prompt design to optimize efficiency.

Significance. If the energy differences can be causally isolated to prompt semantics rather than correlated factors, the work would provide actionable insights for energy-efficient LLM deployment and prompt engineering in sustainability-focused AI research.

major comments (2)

[Abstract] Abstract: the central claim that 'prompt length is less significant than the semantic meaning of the task itself' and that keywords drive energy differences requires explicit controls (e.g., fixed output length, temperature=0, repeated runs with error bars, hardware normalization) to rule out confounding by response length or generation paths; none are described, undermining causal attribution.
[Abstract] Abstract (results paragraph): no quantitative results, effect sizes, statistical tests, or error analysis are supplied despite the claim of distinct energy patterns; this leaves the empirical isolation unverified and the findings non-reproducible from the given description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to strengthen the clarity and reproducibility of our work. We address each major comment below and commit to revising the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'prompt length is less significant than the semantic meaning of the task itself' and that keywords drive energy differences requires explicit controls (e.g., fixed output length, temperature=0, repeated runs with error bars, hardware normalization) to rule out confounding by response length or generation paths; none are described, undermining causal attribution.

Authors: We agree that the abstract does not explicitly list the experimental controls. The full manuscript specifies temperature=0 for deterministic generation, consistent hardware, and multiple runs per prompt; however, response length was not fixed a priori because variation in generated length is itself an outcome of interest. To strengthen causal claims, we will revise the abstract to explicitly state the controls employed and add a sentence noting that we performed post-hoc analysis controlling for response length. We will also include error bars from repeated runs in the revised abstract where space permits. revision: yes
Referee: [Abstract] Abstract (results paragraph): no quantitative results, effect sizes, statistical tests, or error analysis are supplied despite the claim of distinct energy patterns; this leaves the empirical isolation unverified and the findings non-reproducible from the given description.

Authors: The abstract is a concise summary and therefore omits specific numbers and tests that appear in the results section of the full paper. We acknowledge that this omission reduces immediate verifiability. We will revise the abstract to incorporate representative quantitative findings, including effect sizes for the semantic versus length comparison and mention of statistical significance, while remaining within length constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential predictions

full rationale

The paper reports direct experimental measurements of energy consumption across LLMs, tasks, and prompt variations, with results stated as observed patterns (e.g., semantic meaning mattering more than length, and task-specific keywords). No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear in the provided abstract or described methodology. All claims rest on external instrumentation and data collection rather than any internal reduction to the paper's own inputs, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical measurement study; no free parameters, new entities, or mathematical axioms beyond standard assumptions about energy metering.

axioms (1)

domain assumption Energy consumption during LLM inference can be accurately isolated and attributed to prompt characteristics
Required to interpret measured differences as caused by prompt semantics and keywords.

pith-pipeline@v0.9.0 · 5742 in / 1146 out tokens · 29623 ms · 2026-05-23T00:03:09.718416+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems , 2017

work page 2017
[2]

Large language models in medicine,

A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023

work page 1930
[3]

[Online]

Microsoft. [Online]. Available: https://www.bing.com/

work page
[4]

An extensive study on pre-trained models for program understanding and generation,

Z. Zeng, H. Tan, H. Zhang, J. Li, Y . Zhang, and L. Zhang, “An extensive study on pre-trained models for program understanding and generation,” in Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis , 2022, pp. 39–51

work page 2022
[5]

Sparks of artificial general intelligence: Early experiments with gpt-4,

S. Bubeck, V . Chadrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. Lundberget al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” 2023

work page 2023
[6]

Energy and policy con- siderations for modern deep learning research,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con- siderations for modern deep learning research,” in Proceedings of the AAAI conference on artificial intelligence , vol. 34, no. 09, 2020, pp. 13 693–13 696

work page 2020
[7]

[Online]

IEA, 2024, licence: CC BY 4.0. [Online]. Available: https://www.iea. org/reports/electricity-2024

work page 2024
[8]

Estimating the carbon footprint of bloom, a 176b parameter language model,

A. S. Luccioni, S. Viguier, and A.-L. Ligozat, “Estimating the carbon footprint of bloom, a 176b parameter language model,” Journal of Machine Learning Research , vol. 24, no. 253, pp. 1–15, 2023

work page 2023
[9]

Sustainable ai: Environmental implications, challenges and opportunities,

C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Baiet al., “Sustainable ai: Environmental implications, challenges and opportunities,” Proceedings of Machine Learning and Systems , vol. 4, pp. 795–813, 2022

work page 2022
[10]

Hardware approximate techniques for deep neural network accelerators: A survey,

G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware approximate techniques for deep neural network accelerators: A survey,” ACM Computing Surveys , vol. 55, no. 4, pp. 1–36, 2022

work page 2022
[11]

Npe: An fpga-based overlay processor for natural language processing,

H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He, “Npe: An fpga-based overlay processor for natural language processing,” arXiv preprint arXiv:2104.06535, 2021

work page arXiv 2021
[12]

Splitwise: Efficient generative llm inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 118–132

work page 2024
[13]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve},

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve},” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 117–134

work page 2024
[14]

Characterization of large language model development in the datacenter,

Q. Hu, Z. Ye, Z. Wang, G. Wang, M. Zhang, Q. Chen, P. Sun, D. Lin, X. Wang, Y . Luo et al. , “Characterization of large language model development in the datacenter,” in 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) , 2024, pp. 709–729

work page 2024
[15]

Towards greener llms: Bringing energy-efficiency to the forefront of llm infer- ence,

J. Stojkovic, E. Choukse, C. Zhang, I. Goiri, and J. Torrellas, “Towards greener llms: Bringing energy-efficiency to the forefront of llm infer- ence,” arXiv preprint arXiv:2403.20306 , 2024

work page arXiv 2024
[16]

Ask me anything: A simple strategy for prompting language models,

S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, K. Bhatia, I. Chami, and C. Re, “Ask me anything: A simple strategy for prompting language models,” in The Eleventh International Conference on Learning Repre- sentations, 2022

work page 2022
[17]

Black-box prompt optimization: Aligning large language models without model training,

J. Cheng, X. Liu, K. Zheng, P. Ke, H. Wang, Y . Dong, J. Tang, and M. Huang, “Black-box prompt optimization: Aligning large language models without model training,” arXiv preprint arXiv:2311.04155, 2023

work page arXiv 2023
[18]

A survey of prompt engineering meth- ods in large language models for different nlp tasks,

S. Vatsal and H. Dubey, “A survey of prompt engineering meth- ods in large language models for different nlp tasks,” arXiv preprint arXiv:2407.12994, 2024

work page arXiv 2024
[19]

Power hungry processing: Watts driving the cost of ai deployment?

S. Luccioni, Y . Jernite, and E. Strubell, “Power hungry processing: Watts driving the cost of ai deployment?” in Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , 2024, pp. 85– 99

work page 2024
[20]

Energy and policy consid- erations for deep learning in nlp,

E. Strubell, A. Ganesh, and A. Mccallum, “Energy and policy consid- erations for deep learning in nlp,” 01 2019, pp. 3645–3650

work page 2019
[21]

Beyond efficiency: A systematic survey of resource-efficient large language models,

G. Bai, Z. Chai, C. Ling, S. Wang, J. Lu, N. Zhang, T. Shi, Z. Yu, M. Zhu, Y . Zhang et al. , “Beyond efficiency: A systematic survey of resource-efficient large language models,” arXiv preprint arXiv:2401.00625, 2024

work page arXiv 2024
[22]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[23]

Efficient large-scale language model training on gpu clusters using megatron-lm,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al. , “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15

work page 2021
[24]

librosa/librosa: 0.6.3,

V . Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank, J. Wilson, S. Friedler, and S. Luccioni, “Codecarbon: estimate and track carbon emissions from machine learning computing (2021),” DOI: https://doi. org/10.5281/zenodo , vol. 4658424, 2021

work page doi:10.5281/zenodo 2021
[25]

Carbontracker: Tracking and predicting the carbon footprint of training deep learning models,

L. F. W. Anthony, B. Kanding, and R. Selvan, “Carbontracker: Tracking and predicting the carbon footprint of training deep learning models,” arXiv preprint arXiv:2007.03051 , 2020

work page arXiv 2007
[26]

Towards the systematic reporting of the energy and carbon footprints of machine learning,

P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau, “Towards the systematic reporting of the energy and carbon footprints of machine learning,” Journal of Machine Learning Research , vol. 21, no. 248, pp. 1–43, 2020

work page 2020
[27]

Quantifying the Carbon Emissions of Machine Learning

A. Lacoste, A. Luccioni, V . Schmidt, and T. Dandres, “Quanti- fying the carbon emissions of machine learning,” arXiv preprint arXiv:1910.09700, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[28]

Llmcarbon: Modeling the end-to-end carbon footprint of large language models,

A. Faiz, S. Kaneda, R. Wang, R. Osi, P. Sharma, F. Chen, and L. Jiang, “Llmcarbon: Modeling the end-to-end carbon footprint of large language models,” arXiv preprint arXiv:2309.14393 , 2023

work page arXiv 2023
[29]

An Analysis of Deep Neural Network Models for Practical Applications

A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural network models for practical applications,” arXiv preprint arXiv:1605.07678, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[30]

Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus,

D. Li, X. Chen, M. Becchi, and Z. Zong, “Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus,” in 2016 IEEE international conferences on big data and cloud com- puting (BDCloud), social computing and networking (SocialCom), sustainable computing and communications (SustainCom)(BDCloud- SocialCom-SustainCom). IEEE, 2016,...

work page 2016
[31]

Trends in ai inference energy consumption: Beyond the performance-vs-parameter laws of deep learning,

R. Desislavov, F. Mart ´ınez-Plumed, and J. Hern´andez-Orallo, “Trends in ai inference energy consumption: Beyond the performance-vs-parameter laws of deep learning,” Sustainable Computing: Informatics and Sys- tems, vol. 38, p. 100857, 2023

work page 2023
[32]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, pp. 211–252, 2015

work page 2015
[33]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural lan- guage understanding,” arXiv preprint arXiv:1804.07461 , 2018. 8

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

From words to watts: Benchmarking the energy costs of large language model infer- ence,

S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, W. Bergeron, J. Kepner, D. Tiwari, and V . Gadepally, “From words to watts: Benchmarking the energy costs of large language model infer- ence,” in 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2023, pp. 1–9

work page 2023
[35]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023

work page 2023
[36]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023

work page 2023
[38]

Gemma: Open models based on gemini research and technology,

G. Team, T. Mesnard, C. Hardin, and R. D. et al., “Gemma: Open models based on gemini research and technology,” 2024

work page 2024
[39]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” 2023

work page 2023
[40]

Know what you don’t know: Unanswerable questions for squad,

P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” CoRR, 2018

work page 2018
[41]

Instruction tuning with gpt-4,

B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” 2023

work page 2023
[42]

Webglm: Towards an efficient web-enhanced question answering system with human preferences,

X. Liu, H. Lai, H. Yu, Y . Xu, A. Zeng, Z. Du, P. Zhang, Y . Dong, and J. Tang, “Webglm: Towards an efficient web-enhanced question answering system with human preferences,” 2023. [Online]. Available: https://arxiv.org/abs/2306.07906

work page arXiv 2023
[43]

Learning word vectors for sentiment analysis,

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 142–150

work page 2011
[44]

Character-level Convolutional Networks for Text Classification

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” 2015. [Online]. Available: https: //arxiv.org/abs/1509.01626

work page internal anchor Pith review Pith/arXiv arXiv 2015
[45]

Tsatc: Twitter sentiment analysis training corpus,

I. Naji, “Tsatc: Twitter sentiment analysis training corpus,” in thinknook, 2012. [Online]. Available: https://huggingface.co/datasets/ carblacac/twitter-sentiment-analysis

work page 2012
[46]

Orca: Progressive learning from complex explanation traces of gpt-4,

S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” 2023

work page 2023
[47]

Helpsteer: Multi-attribute helpfulness dataset for steerlm,

Z. Wang, Y . Dong, J. Zeng, V . Adams, M. N. Sreedhar, D. Egert, O. Delalleau, J. P. Scowcroft, N. Kant, A. Swope, and O. Kuchaiev, “Helpsteer: Multi-attribute helpfulness dataset for steerlm,” 2023. [Online]. Available: https://arxiv.org/abs/2311.09528

work page arXiv 2023
[48]

GPTeacher General-Instruct dataset

“GPTeacher General-Instruct dataset.” [Online]. Available: https: //huggingface.co/datasets/teknium/GPTeacher-General-Instruct

work page
[49]

Zeus: Understanding and optimizing GPU energy consumption of DNN training,

J. You, J.-W. Chung, and M. Chowdhury, “Zeus: Understanding and optimizing GPU energy consumption of DNN training,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 119–139

work page 2023
[50]

When to stop? towards efficient code generation in llms with excess token prevention,

L. Guo, Y . Wang, E. Shi, W. Zhong, H. Zhang, J. Chen, R. Zhang, Y . Ma, and Z. Zheng, “When to stop? towards efficient code generation in llms with excess token prevention,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis , 2024, pp. 1073–1085. 9

work page 2024

[1] [1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems , 2017

work page 2017

[2] [2]

Large language models in medicine,

A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023

work page 1930

[3] [3]

[Online]

Microsoft. [Online]. Available: https://www.bing.com/

work page

[4] [4]

An extensive study on pre-trained models for program understanding and generation,

Z. Zeng, H. Tan, H. Zhang, J. Li, Y . Zhang, and L. Zhang, “An extensive study on pre-trained models for program understanding and generation,” in Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis , 2022, pp. 39–51

work page 2022

[5] [5]

Sparks of artificial general intelligence: Early experiments with gpt-4,

S. Bubeck, V . Chadrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. Lundberget al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” 2023

work page 2023

[6] [6]

Energy and policy con- siderations for modern deep learning research,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con- siderations for modern deep learning research,” in Proceedings of the AAAI conference on artificial intelligence , vol. 34, no. 09, 2020, pp. 13 693–13 696

work page 2020

[7] [7]

[Online]

IEA, 2024, licence: CC BY 4.0. [Online]. Available: https://www.iea. org/reports/electricity-2024

work page 2024

[8] [8]

Estimating the carbon footprint of bloom, a 176b parameter language model,

A. S. Luccioni, S. Viguier, and A.-L. Ligozat, “Estimating the carbon footprint of bloom, a 176b parameter language model,” Journal of Machine Learning Research , vol. 24, no. 253, pp. 1–15, 2023

work page 2023

[9] [9]

Sustainable ai: Environmental implications, challenges and opportunities,

C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Baiet al., “Sustainable ai: Environmental implications, challenges and opportunities,” Proceedings of Machine Learning and Systems , vol. 4, pp. 795–813, 2022

work page 2022

[10] [10]

Hardware approximate techniques for deep neural network accelerators: A survey,

G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware approximate techniques for deep neural network accelerators: A survey,” ACM Computing Surveys , vol. 55, no. 4, pp. 1–36, 2022

work page 2022

[11] [11]

Npe: An fpga-based overlay processor for natural language processing,

H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He, “Npe: An fpga-based overlay processor for natural language processing,” arXiv preprint arXiv:2104.06535, 2021

work page arXiv 2021

[12] [12]

Splitwise: Efficient generative llm inference using phase splitting,

P. Patel, E. Choukse, C. Zhang, A. Shah, ´I. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 118–132

work page 2024

[13] [13]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve},

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve},” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 117–134

work page 2024

[14] [14]

Characterization of large language model development in the datacenter,

Q. Hu, Z. Ye, Z. Wang, G. Wang, M. Zhang, Q. Chen, P. Sun, D. Lin, X. Wang, Y . Luo et al. , “Characterization of large language model development in the datacenter,” in 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24) , 2024, pp. 709–729

work page 2024

[15] [15]

Towards greener llms: Bringing energy-efficiency to the forefront of llm infer- ence,

J. Stojkovic, E. Choukse, C. Zhang, I. Goiri, and J. Torrellas, “Towards greener llms: Bringing energy-efficiency to the forefront of llm infer- ence,” arXiv preprint arXiv:2403.20306 , 2024

work page arXiv 2024

[16] [16]

Ask me anything: A simple strategy for prompting language models,

S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, K. Bhatia, I. Chami, and C. Re, “Ask me anything: A simple strategy for prompting language models,” in The Eleventh International Conference on Learning Repre- sentations, 2022

work page 2022

[17] [17]

Black-box prompt optimization: Aligning large language models without model training,

J. Cheng, X. Liu, K. Zheng, P. Ke, H. Wang, Y . Dong, J. Tang, and M. Huang, “Black-box prompt optimization: Aligning large language models without model training,” arXiv preprint arXiv:2311.04155, 2023

work page arXiv 2023

[18] [18]

A survey of prompt engineering meth- ods in large language models for different nlp tasks,

S. Vatsal and H. Dubey, “A survey of prompt engineering meth- ods in large language models for different nlp tasks,” arXiv preprint arXiv:2407.12994, 2024

work page arXiv 2024

[19] [19]

Power hungry processing: Watts driving the cost of ai deployment?

S. Luccioni, Y . Jernite, and E. Strubell, “Power hungry processing: Watts driving the cost of ai deployment?” in Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , 2024, pp. 85– 99

work page 2024

[20] [20]

Energy and policy consid- erations for deep learning in nlp,

E. Strubell, A. Ganesh, and A. Mccallum, “Energy and policy consid- erations for deep learning in nlp,” 01 2019, pp. 3645–3650

work page 2019

[21] [21]

Beyond efficiency: A systematic survey of resource-efficient large language models,

G. Bai, Z. Chai, C. Ling, S. Wang, J. Lu, N. Zhang, T. Shi, Z. Yu, M. Zhu, Y . Zhang et al. , “Beyond efficiency: A systematic survey of resource-efficient large language models,” arXiv preprint arXiv:2401.00625, 2024

work page arXiv 2024

[22] [22]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[23] [23]

Efficient large-scale language model training on gpu clusters using megatron-lm,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al. , “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15

work page 2021

[24] [24]

librosa/librosa: 0.6.3,

V . Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank, J. Wilson, S. Friedler, and S. Luccioni, “Codecarbon: estimate and track carbon emissions from machine learning computing (2021),” DOI: https://doi. org/10.5281/zenodo , vol. 4658424, 2021

work page doi:10.5281/zenodo 2021

[25] [25]

Carbontracker: Tracking and predicting the carbon footprint of training deep learning models,

L. F. W. Anthony, B. Kanding, and R. Selvan, “Carbontracker: Tracking and predicting the carbon footprint of training deep learning models,” arXiv preprint arXiv:2007.03051 , 2020

work page arXiv 2007

[26] [26]

Towards the systematic reporting of the energy and carbon footprints of machine learning,

P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau, “Towards the systematic reporting of the energy and carbon footprints of machine learning,” Journal of Machine Learning Research , vol. 21, no. 248, pp. 1–43, 2020

work page 2020

[27] [27]

Quantifying the Carbon Emissions of Machine Learning

A. Lacoste, A. Luccioni, V . Schmidt, and T. Dandres, “Quanti- fying the carbon emissions of machine learning,” arXiv preprint arXiv:1910.09700, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[28] [28]

Llmcarbon: Modeling the end-to-end carbon footprint of large language models,

A. Faiz, S. Kaneda, R. Wang, R. Osi, P. Sharma, F. Chen, and L. Jiang, “Llmcarbon: Modeling the end-to-end carbon footprint of large language models,” arXiv preprint arXiv:2309.14393 , 2023

work page arXiv 2023

[29] [29]

An Analysis of Deep Neural Network Models for Practical Applications

A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural network models for practical applications,” arXiv preprint arXiv:1605.07678, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[30] [30]

Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus,

D. Li, X. Chen, M. Becchi, and Z. Zong, “Evaluating the energy efficiency of deep convolutional neural networks on cpus and gpus,” in 2016 IEEE international conferences on big data and cloud com- puting (BDCloud), social computing and networking (SocialCom), sustainable computing and communications (SustainCom)(BDCloud- SocialCom-SustainCom). IEEE, 2016,...

work page 2016

[31] [31]

Trends in ai inference energy consumption: Beyond the performance-vs-parameter laws of deep learning,

R. Desislavov, F. Mart ´ınez-Plumed, and J. Hern´andez-Orallo, “Trends in ai inference energy consumption: Beyond the performance-vs-parameter laws of deep learning,” Sustainable Computing: Informatics and Sys- tems, vol. 38, p. 100857, 2023

work page 2023

[32] [32]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, pp. 211–252, 2015

work page 2015

[33] [33]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural lan- guage understanding,” arXiv preprint arXiv:1804.07461 , 2018. 8

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

From words to watts: Benchmarking the energy costs of large language model infer- ence,

S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, W. Bergeron, J. Kepner, D. Tiwari, and V . Gadepally, “From words to watts: Benchmarking the energy costs of large language model infer- ence,” in 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2023, pp. 1–9

work page 2023

[35] [35]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023

work page 2023

[36] [36]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023

work page 2023

[38] [38]

Gemma: Open models based on gemini research and technology,

G. Team, T. Mesnard, C. Hardin, and R. D. et al., “Gemma: Open models based on gemini research and technology,” 2024

work page 2024

[39] [39]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” 2023

work page 2023

[40] [40]

Know what you don’t know: Unanswerable questions for squad,

P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” CoRR, 2018

work page 2018

[41] [41]

Instruction tuning with gpt-4,

B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” 2023

work page 2023

[42] [42]

Webglm: Towards an efficient web-enhanced question answering system with human preferences,

X. Liu, H. Lai, H. Yu, Y . Xu, A. Zeng, Z. Du, P. Zhang, Y . Dong, and J. Tang, “Webglm: Towards an efficient web-enhanced question answering system with human preferences,” 2023. [Online]. Available: https://arxiv.org/abs/2306.07906

work page arXiv 2023

[43] [43]

Learning word vectors for sentiment analysis,

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 142–150

work page 2011

[44] [44]

Character-level Convolutional Networks for Text Classification

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” 2015. [Online]. Available: https: //arxiv.org/abs/1509.01626

work page internal anchor Pith review Pith/arXiv arXiv 2015

[45] [45]

Tsatc: Twitter sentiment analysis training corpus,

I. Naji, “Tsatc: Twitter sentiment analysis training corpus,” in thinknook, 2012. [Online]. Available: https://huggingface.co/datasets/ carblacac/twitter-sentiment-analysis

work page 2012

[46] [46]

Orca: Progressive learning from complex explanation traces of gpt-4,

S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” 2023

work page 2023

[47] [47]

Helpsteer: Multi-attribute helpfulness dataset for steerlm,

Z. Wang, Y . Dong, J. Zeng, V . Adams, M. N. Sreedhar, D. Egert, O. Delalleau, J. P. Scowcroft, N. Kant, A. Swope, and O. Kuchaiev, “Helpsteer: Multi-attribute helpfulness dataset for steerlm,” 2023. [Online]. Available: https://arxiv.org/abs/2311.09528

work page arXiv 2023

[48] [48]

GPTeacher General-Instruct dataset

“GPTeacher General-Instruct dataset.” [Online]. Available: https: //huggingface.co/datasets/teknium/GPTeacher-General-Instruct

work page

[49] [49]

Zeus: Understanding and optimizing GPU energy consumption of DNN training,

J. You, J.-W. Chung, and M. Chowdhury, “Zeus: Understanding and optimizing GPU energy consumption of DNN training,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 119–139

work page 2023

[50] [50]

When to stop? towards efficient code generation in llms with excess token prevention,

L. Guo, Y . Wang, E. Shi, W. Zhong, H. Zhang, J. Chen, R. Zhang, Y . Ma, and Z. Zheng, “When to stop? towards efficient code generation in llms with excess token prevention,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis , 2024, pp. 1073–1085. 9

work page 2024