Energy-Aware LLMs: A step towards sustainable AI for downstream applications

Brigitte Jaumard; Nguyen Phuc Tran; Oscar Delgado

arxiv: 2503.17783 · v1 · submitted 2025-03-22 · 💻 cs.PF · cs.AI· cs.CL· cs.LG

Energy-Aware LLMs: A step towards sustainable AI for downstream applications

Nguyen Phuc Tran , Brigitte Jaumard , Oscar Delgado This is my paper

Pith reviewed 2026-05-22 23:13 UTC · model grok-4.3

classification 💻 cs.PF cs.AIcs.CLcs.LG

keywords energy efficiencyquantizationpruninglarge language modelsfault ticket analysiscommunication networksmodel compressionsustainable computing

0 comments

The pith

An appropriate combination of quantization and pruning reduces energy consumption in LLMs while improving performance on fault analysis tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops an end-to-end pipeline to examine how quantization and pruning affect both the energy use and accuracy of large language models applied to fault ticket analysis in communication networks. The pipeline is tested using two real-world datasets on the tasks of root cause analysis and response feedback. Results indicate that suitable levels of these techniques lower energy needs and raise model performance at the same time. Readers might care because high energy demands currently limit the practical deployment of advanced AI in resource-sensitive domains like network management.

Core claim

The paper establishes that an appropriate combination of quantization and pruning techniques is able to reduce energy consumption while significantly improving model performance for an LLM during fault ticket analysis in communication networks, as shown through evaluation on two real-world datasets for root cause analysis and response feedback.

What carries the argument

An end-to-end pipeline that applies quantization and pruning to an LLM and measures the resulting energy-performance trade-off on fault ticket datasets.

If this is right

Lower energy consumption for LLM-based applications in communication networks.
Enhanced performance in root cause analysis and response feedback tasks.
Feasibility of sustainable AI deployment for downstream tasks without sacrificing accuracy.
Trade-off management through targeted model compression methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar combinations could extend to other high-energy AI applications outside of networks.
Future work might test these techniques on larger models or different datasets to confirm generalizability.
Integration with hardware-specific optimizations could yield additional efficiency gains.

Load-bearing premise

The selected quantization and pruning levels on the chosen LLM and the two fault-ticket datasets yield genuine performance gains rather than results tied to particular metrics or data choices.

What would settle it

Re-evaluating the pipeline using alternative performance metrics or including standard baseline LLMs without quantization and pruning to determine if the reported improvements hold.

Figures

Figures reproduced from arXiv: 2503.17783 by Brigitte Jaumard, Nguyen Phuc Tran, Oscar Delgado.

**Figure 2.** Figure 2: Energy-Performance pipeline evaluation for LLMs. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Energy consumption vs. training loss on each epoch [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: highlight the effects of quantization levels during the inference phase. Similar to the fine-tuning phase, the 16-bit model stands out as one of the top candidates for energy efficiency (reduction up to 40.5% overall), offering a good compromise with model performance. In detail, the BERT score drops slightly while other metrics show a slight improvement. The 8-bit model demonstrates an improvement across… view at source ↗

**Figure 6.** Figure 6: LLAMA3: impact of unstructured-base pruning and [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Advanced Large Language Models (LLMs) have revolutionized various fields, including communication networks, sparking an innovation wave that has led to new applications and services, and significantly enhanced solution schemes. Despite all these impressive developments, most LLMs typically require huge computational resources, resulting in terribly high energy consumption. Thus, this research study proposes an end-to-end pipeline that investigates the trade-off between energy efficiency and model performance for an LLM during fault ticket analysis in communication networks. It further evaluates the pipeline performance using two real-world datasets for the tasks of root cause analysis and response feedback in a communication network. Our results show that an appropriate combination of quantization and pruning techniques is able to reduce energy consumption while significantly improving model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes an end-to-end pipeline investigating the trade-off between energy efficiency and performance for LLMs applied to fault ticket analysis in communication networks. It evaluates quantization and pruning on two real-world datasets for root cause analysis and response feedback tasks, claiming that an appropriate combination of these techniques reduces energy consumption while significantly improving model performance.

Significance. If the empirical gains are robustly demonstrated, the result would be significant because it would show simultaneous energy reduction and performance improvement, contrary to the typical accuracy-efficiency trade-off in model compression. The use of real-world communication network datasets provides practical grounding for sustainable AI in downstream applications.

major comments (1)

[Abstract] Abstract: the central claim that 'an appropriate combination of quantization and pruning techniques is able to reduce energy consumption while significantly improving model performance' is unsupported by any reported metrics, baseline comparisons against the unoptimized LLM, ablation results on quantization/pruning levels, error bars, or statistical tests. This is load-bearing for the headline result because quantization and pruning normally degrade accuracy, so the reported improvement requires explicit controls to rule out metric or data artifacts.

minor comments (1)

[Abstract] Abstract: the two datasets and the specific LLM are referred to only generically; naming them and providing basic statistics (size, class balance) would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger evidentiary support for the central claim in the abstract. We address this point below and commit to revisions that will make the empirical results more transparent and robust.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'an appropriate combination of quantization and pruning techniques is able to reduce energy consumption while significantly improving model performance' is unsupported by any reported metrics, baseline comparisons against the unoptimized LLM, ablation results on quantization/pruning levels, error bars, or statistical tests. This is load-bearing for the headline result because quantization and pruning normally degrade accuracy, so the reported improvement requires explicit controls to rule out metric or data artifacts.

Authors: We agree that the headline claim of simultaneous energy reduction and performance improvement is counter to the usual compression trade-off and therefore requires explicit controls. In the revised manuscript we will add: (1) direct baseline comparisons against the unoptimized full-precision LLM on both datasets and tasks, (2) ablation tables showing performance and energy at multiple quantization bit-widths and pruning ratios, (3) error bars derived from at least three independent runs with different random seeds, and (4) statistical significance tests (paired t-tests or Wilcoxon signed-rank) on the observed gains. These additions will be placed in the results section and referenced from a revised abstract so that the claim is no longer unsupported. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivation chain

full rationale

The paper presents an end-to-end empirical pipeline evaluating quantization and pruning on LLMs for fault-ticket tasks using two real-world datasets. No mathematical derivation, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided abstract or description. Results are framed as direct measurements of energy and performance metrics rather than outputs derived from prior fitted values or self-referential definitions. The central claim rests on experimental outcomes, which are independently falsifiable via replication on the datasets and thus do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about dataset representativeness, metric validity, and the absence of post-hoc selection of quantization/pruning levels.

pith-pipeline@v0.9.0 · 5654 in / 1083 out tokens · 33106 ms · 2026-05-22T23:13:19.731906+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

[1]

Soman and R

S. Soman and R. HG. Observations on llms for telecom domain: capabilities and limitations. In Proceedings of the 3rd Int. Conference on AI-ML Systems , pp. 1–5, 2023

work page 2023
[2]

Chen et al

Y . Chen et al. Automatic root cause analysis via large language models for cloud incidents. In Proceedings of the 19th European Conference on Computer Systems , pp. 674–688, 2024

work page 2024
[3]

Roychowdhury et al

S. Roychowdhury et al. Unlocking telecom domain knowledge using llms. In 16th Int. Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 267–269. IEEE, 2024

work page 2024
[4]

Patel et al

P. Patel et al. Characterizing power management opportunities for llms in the cloud. In Proceedings of the 29th ACM Int. Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pp. 207–222, 2024

work page 2024
[5]

Maatouk et al

A. Maatouk et al. Large language models for telecom: Forthcoming impact on the industry. IEEE Communications Magazine , 2024

work page 2024
[6]

A. H. Zadeh et al. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 53rd Annual IEEE/ACM Int. Symposium on Microarchitecture (MICRO) , pp. 811–824, 2020

work page 2020
[7]

E. J. Hu et al. LoRA: Low-rank adaptation of large language models. In Int. Conference on Learning Representations , 2022

work page 2022
[8]

Li et al

W. Li et al. Quantization and hardware architecture co-design for matrix- vector multiplications of large language models. IEEE Transactions on Circuits and Systems I: Regular Papers , 71(6):2858–2871, 2024

work page 2024
[9]

Anwar et al

S. Anwar et al. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):1–18, 2017

work page 2017
[10]

He and L

Y . He and L. Xiao. Structured pruning for deep convolutional neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 2023

work page 2023
[11]

Frantar and D

E. Frantar and D. Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In Int. Conference on Machine Learning, pp. 10323–10337. PMLR, 2023

work page 2023
[12]

Filighera et al

A. Filighera et al. Your answer is incorrect... would you like to know why? introducing a bilingual short answer feedback dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1) , pp. 8577–8591, 2022

work page 2022
[13]

Xia et al

H. Xia et al. Quant-LLM: Accelerating the Serving of Large Lan- guage Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs. In USENIX Annual Technical Conference (USENIX ATC 24), pp. 699–713, 2024

work page 2024
[14]

Sun et al

M. Sun et al. A simple and effective pruning approach for large language models. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023

work page 2023
[15]

Bannour et al

N. Bannour et al. Evaluating the carbon footprint of nlp methods: a survey and analysis of existing tools. In 2nd workshop on simple and efficient natural language processing , pp. 11–21, 2021

work page 2021
[16]

Zhang et al

T. Zhang et al. BERTScore: Evaluating Text Generation with BERT. In Int. Conference on Learning Representations (ICLR) , 2020

work page 2020
[17]

Banerjee and A

S. Banerjee and A. Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pp. 65–72, 2005

work page 2005
[18]

Papineni et al

K. Papineni et al. Bleu: a method for automatic evaluation of machine translation. In 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002

work page 2002
[19]

C.-Y . Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text summarization branches out , pp. 74–81. ACL, 2004

work page 2004
[20]

Dubey et al

A. Dubey et al. The llama 3 herd of models. Meta, 2024

work page 2024
[21]

Gemma: Open Models Based on Gemini Research and Technology

T. Mesnard et al. Gemma: Open models based on gemini research and technology. Google Deep Mind , abs/2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Zhao et al

Y . Zhao et al. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems , 6:196–209, 2024

work page 2024

[1] [1]

Soman and R

S. Soman and R. HG. Observations on llms for telecom domain: capabilities and limitations. In Proceedings of the 3rd Int. Conference on AI-ML Systems , pp. 1–5, 2023

work page 2023

[2] [2]

Chen et al

Y . Chen et al. Automatic root cause analysis via large language models for cloud incidents. In Proceedings of the 19th European Conference on Computer Systems , pp. 674–688, 2024

work page 2024

[3] [3]

Roychowdhury et al

S. Roychowdhury et al. Unlocking telecom domain knowledge using llms. In 16th Int. Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 267–269. IEEE, 2024

work page 2024

[4] [4]

Patel et al

P. Patel et al. Characterizing power management opportunities for llms in the cloud. In Proceedings of the 29th ACM Int. Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pp. 207–222, 2024

work page 2024

[5] [5]

Maatouk et al

A. Maatouk et al. Large language models for telecom: Forthcoming impact on the industry. IEEE Communications Magazine , 2024

work page 2024

[6] [6]

A. H. Zadeh et al. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 53rd Annual IEEE/ACM Int. Symposium on Microarchitecture (MICRO) , pp. 811–824, 2020

work page 2020

[7] [7]

E. J. Hu et al. LoRA: Low-rank adaptation of large language models. In Int. Conference on Learning Representations , 2022

work page 2022

[8] [8]

Li et al

W. Li et al. Quantization and hardware architecture co-design for matrix- vector multiplications of large language models. IEEE Transactions on Circuits and Systems I: Regular Papers , 71(6):2858–2871, 2024

work page 2024

[9] [9]

Anwar et al

S. Anwar et al. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):1–18, 2017

work page 2017

[10] [10]

He and L

Y . He and L. Xiao. Structured pruning for deep convolutional neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 2023

work page 2023

[11] [11]

Frantar and D

E. Frantar and D. Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In Int. Conference on Machine Learning, pp. 10323–10337. PMLR, 2023

work page 2023

[12] [12]

Filighera et al

A. Filighera et al. Your answer is incorrect... would you like to know why? introducing a bilingual short answer feedback dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1) , pp. 8577–8591, 2022

work page 2022

[13] [13]

Xia et al

H. Xia et al. Quant-LLM: Accelerating the Serving of Large Lan- guage Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs. In USENIX Annual Technical Conference (USENIX ATC 24), pp. 699–713, 2024

work page 2024

[14] [14]

Sun et al

M. Sun et al. A simple and effective pruning approach for large language models. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023

work page 2023

[15] [15]

Bannour et al

N. Bannour et al. Evaluating the carbon footprint of nlp methods: a survey and analysis of existing tools. In 2nd workshop on simple and efficient natural language processing , pp. 11–21, 2021

work page 2021

[16] [16]

Zhang et al

T. Zhang et al. BERTScore: Evaluating Text Generation with BERT. In Int. Conference on Learning Representations (ICLR) , 2020

work page 2020

[17] [17]

Banerjee and A

S. Banerjee and A. Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization , pp. 65–72, 2005

work page 2005

[18] [18]

Papineni et al

K. Papineni et al. Bleu: a method for automatic evaluation of machine translation. In 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002

work page 2002

[19] [19]

C.-Y . Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text summarization branches out , pp. 74–81. ACL, 2004

work page 2004

[20] [20]

Dubey et al

A. Dubey et al. The llama 3 herd of models. Meta, 2024

work page 2024

[21] [21]

Gemma: Open Models Based on Gemini Research and Technology

T. Mesnard et al. Gemma: Open models based on gemini research and technology. Google Deep Mind , abs/2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Zhao et al

Y . Zhao et al. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems , 6:196–209, 2024

work page 2024