EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

Gianluca Palermo; Michael E. Papka; Vittorio Palladino; Zhiling Lan

arxiv: 2605.10556 · v2 · submitted 2026-05-11 · 💻 cs.CV · cs.LG

EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving

Vittorio Palladino , Gianluca Palermo , Michael E. Papka , Zhiling Lan This is my paper

Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords energy modelingLLM inferencesymbolic regressionconfiguration optimizationclosed-form modelsmultimodal workloadsparallelism strategies

0 comments

The pith

Symbolic regression on 50 energy measurements produces a single twelve-parameter closed-form model that selects energy-optimal LLM inference configurations across models and hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EnergyLens treats symbolic regression as a structure-discovery tool applied to a modest set of profiling runs. It recovers a compact mathematical expression that predicts energy consumption by separating the contributions of tensor parallelism from pipeline parallelism and prefill from decode phases. The resulting formula remains fixed in structure while its coefficients are fitted once, allowing the same model to generalize to new batch sizes, sequence lengths, and accelerator platforms. This approach avoids both latency proxies and data-intensive black-box predictors that lose accuracy when parallelism strategies change.

Core claim

EnergyLens derives a single twelve-parameter closed-form energy model from profiling data that expresses inference energy in terms of degree of parallelism, batch size, and sequence length. The model explicitly decouples tensor and pipeline parallelism contributions and isolates prefill energy from decode energy, yielding predictions that remain interpretable and actionable for configuration selection.

What carries the argument

The twelve-parameter closed-form energy model discovered by symbolic regression, which isolates tensor versus pipeline parallelism and prefill versus decode phases to produce physically interpretable energy predictions.

If this is right

The model achieves 88.2 percent top-1 accuracy in selecting energy-optimal configurations using only 50 profiling samples.
It matches the accuracy of ensemble machine-learning methods while requiring roughly ten times fewer measurements.
The same functional form extrapolates to batch sizes and hardware platforms not present in the original profiling set.
Decoupling parallelism types and prefill/decode phases supplies concrete guidance for choosing parallelism strategies that reduce energy without sacrificing throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment systems could embed the model to evaluate candidate configurations at runtime before launching an inference job.
The separation of prefill and decode terms suggests analogous closed-form models could be derived for other resources such as memory bandwidth.
Because the structure is fixed, the approach invites testing whether the same twelve-parameter skeleton applies to emerging state-space model architectures with only coefficient updates.

Load-bearing premise

A single functional form fitted from limited profiling runs will remain valid and generalize across model architectures, hardware platforms, and multimodal workloads without structural changes or additional parameters.

What would settle it

Observing that the fixed twelve-parameter formula produces large prediction errors on energy measurements from a previously unseen model family or accelerator when the same coefficients are used without refitting.

Figures

Figures reproduced from arXiv: 2605.10556 by Gianluca Palermo, Michael E. Papka, Vittorio Palladino, Zhiling Lan.

**Figure 1.** Figure 1: Overview of the ENERGYLENS training pipeline. The procedure consists of five stages: (1) definition of the input space, comprising configuration parameters, families of models, GPU architectures, and workload types; (2) systematic energy profiling of multimodal LLM serving across diverse models and configurations, producing reusable datasets; (3) symbolic-regression-driven model construction that discovers… view at source ↗

**Figure 3.** Figure 3: Average energy (Joules) versus TP/PP configuration for four model [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 2.** Figure 2: Average latency (seconds) versus TP/PP configuration for four [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Average energy (Joules) versus TP/PP configuration for Mistral-7B [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Average per-inference energy (Joules) versus batch size for Qwen [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Average energy (Joules) versus TP/PP configuration for Mistral-7B [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Average energy (Joules) versus TP/PP configuration for Qwen-7B [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Average energy (Joules) versus TP/PP configuration for Qwen2-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 7.** Figure 7: Average energy (Joules) versus TP/PP configuration for Qwen2-VL-7B [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Pairwise accuracy without per-configuration power measurements. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 8.** Figure 8: Pairwise accuracy without per-configuration power measurements. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnergyLens finds a 12-parameter closed-form energy model via symbolic regression that splits prefill/decode and tensor/pipeline terms, with reported 88% selection accuracy from 50 samples.

read the letter

The core advance is using symbolic regression on a small set of profiling runs to produce one fixed algebraic form that explicitly separates prefill energy from decode energy and tensor parallelism from pipeline parallelism. That decomposition is what lets the model stay interpretable and still hit 88.2% top-1 configuration accuracy while using roughly 10x fewer samples than black-box ensembles. The extrapolation claim to new batch sizes and hardware without changing the equation skeleton is the part that would matter most in practice if it holds up.

Referee Report

3 major / 2 minor

Summary. The manuscript presents EnergyLens, which applies symbolic regression to profiling data to obtain a single twelve-parameter closed-form energy model for LLM inference. The model decouples tensor and pipeline parallelism and separates prefill and decode phases. It reports achieving 88.2% top-1 accuracy in selecting energy-optimal configurations using as few as 50 samples, outperforming prior analytical methods (60.9%) and matching ensemble ML methods with 10x fewer samples, while extrapolating to unseen batch sizes and hardware platforms without changing the model structure.

Significance. If validated, the work would offer an interpretable and sample-efficient alternative to black-box energy models for optimizing LLM serving on heterogeneous hardware, potentially aiding practical deployment decisions where energy is a key constraint. The closed-form nature and physical interpretability are strengths compared to opaque ML surrogates.

major comments (3)

Abstract: The abstract states concrete accuracy figures (88.2% Top-1, 60.9% baseline) and sample counts (50 profiling measurements) but provides no information on measurement methodology, statistical significance, cross-validation procedure, or how extrapolation was quantified; without these the central performance claims cannot be verified.
§4 (Evaluation): The twelve parameters are obtained by fitting to the same profiling data used for evaluation, so any reported accuracy or extrapolation performance is necessarily dependent on those fitted values rather than an independent derivation; the symbolic-regression step discovers structure but does not remove the circularity.
§3.2 (Symbolic Regression): The central claim requires that symbolic regression produces one fixed functional skeleton whose algebraic structure remains unchanged across model architectures (dense, MoE, state-space), hardware platforms, and multimodal workloads. No evidence is provided that the identical equation skeleton was recovered or imposed across the reported evaluation scenarios; if sensitive to the initial profiling distribution, refitting coefficients alone would fail to recover the claimed 88.2% accuracy and 10× sample-efficiency.

minor comments (2)

Notation throughout: The explicit decoupling of tensor/pipeline parallelism and prefill/decode contributions is described conceptually but the precise algebraic terms for each (e.g., how batch size and sequence length enter each component) could be stated with numbered equations for clarity.
Figures 3-5: Axis labels and legends should explicitly note the number of profiling runs and whether the plotted points are in-sample or held-out to support the extrapolation claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps improve the clarity and rigor of our claims regarding EnergyLens. We address each major comment below and will incorporate revisions to address concerns about verifiability and methodological details.

read point-by-point responses

Referee: Abstract: The abstract states concrete accuracy figures (88.2% Top-1, 60.9% baseline) and sample counts (50 profiling measurements) but provides no information on measurement methodology, statistical significance, cross-validation procedure, or how extrapolation was quantified; without these the central performance claims cannot be verified.

Authors: We agree the abstract is overly concise and will revise it to briefly note the measurement methodology (power metering on heterogeneous GPUs during profiling runs), indicate that results are from 5-fold cross-validation, and state that extrapolation tests use held-out batch sizes and platforms. Full statistical details and procedures remain in §4, but this addition will make the abstract self-contained for the key claims. revision: yes
Referee: §4 (Evaluation): The twelve parameters are obtained by fitting to the same profiling data used for evaluation, so any reported accuracy or extrapolation performance is necessarily dependent on those fitted values rather than an independent derivation; the symbolic-regression step discovers structure but does not remove the circularity.

Authors: This is a valid observation about potential circularity in the current presentation. We will revise §4 to explicitly describe our protocol: symbolic regression and coefficient fitting are performed on training folds (80% of the 50 samples), with top-1 accuracy and extrapolation metrics computed on held-out test folds and entirely separate profiling runs for unseen batch sizes/hardware. We will report per-fold results and confirm no data leakage in the extrapolation experiments. revision: yes
Referee: §3.2 (Symbolic Regression): The central claim requires that symbolic regression produces one fixed functional skeleton whose algebraic structure remains unchanged across model architectures (dense, MoE, state-space), hardware platforms, and multimodal workloads. No evidence is provided that the identical equation skeleton was recovered or imposed across the reported evaluation scenarios; if sensitive to the initial profiling distribution, refitting coefficients alone would fail to recover the claimed 88.2% accuracy and 10× sample-efficiency.

Authors: We will revise §3.2 and add a new table in the main text (or appendix) showing the functional forms recovered when symbolic regression is run independently on data subsets for dense, MoE, and state-space models across hardware platforms. The identical twelve-parameter skeleton emerged consistently without being imposed a priori. We will also include a brief sensitivity analysis to initial profiling distributions to demonstrate robustness of the structure. revision: yes

Circularity Check

1 steps flagged

12-parameter model fitted to profiling data; accuracy and extrapolation claims reduce to fit quality on same data

specific steps

fitted input called prediction [Abstract]
"Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification."

The twelve parameters (and the functional skeleton) are obtained by fitting to the profiling measurements; the accuracy, sample-efficiency, and extrapolation figures are measured on configurations drawn from the same or closely related profiling runs, so the performance numbers are a direct consequence of how well the fitted model reproduces its own input data rather than an independent test of a derivation.

full rationale

The derivation chain begins with symbolic regression applied to ~50 profiling runs to discover a 12-parameter functional form, followed by coefficient fitting on those same measurements. The headline claims of 88.2% Top-1 accuracy, 10x sample efficiency versus ensembles, and reliable extrapolation to unseen batches/hardware are then asserted after this fitting step. Because the evaluation scenarios draw from the identical profiling distribution used to obtain both structure and coefficients, the reported performance metrics are statistically dependent on the input data rather than arising from an independent first-principles derivation. This matches the fitted-input-called-prediction pattern; no self-citation load-bearing or ansatz smuggling is required to reach the reduction. The result is partial circularity: the model is useful as a compact surrogate but its claimed predictive power is not independent of the fitting data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical discovery of a twelve-parameter functional form from limited profiling data; the parameters themselves are free and fitted, while the assumption that a single fixed structure suffices across models and hardware is an unproven domain assumption.

free parameters (1)

twelve model parameters
Obtained by fitting symbolic regression to 50 profiling measurements; these coefficients encode the discovered energy contributions of parallelism, batch size, and sequence length.

axioms (1)

domain assumption Energy consumption can be additively decomposed into prefill and decode phases whose contributions are independent of each other once parallelism settings are fixed.
Invoked by the claim that the model separates prefill from decode energy.

pith-pipeline@v0.9.0 · 5552 in / 1494 out tokens · 57219 ms · 2026-05-14T21:20:40.307698+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

work page 2020
[2]

Data-driven surrogate models for LLM energy consumption in distributed systems,

S. Cheng, M. M. Zavlanoset al., “Data-driven surrogate models for LLM energy consumption in distributed systems,”ACM Transactions on Embedded Computing Systems, 2025

work page 2025
[3]

MaverIQ: Fingerprint-guided extrapolation and fragmentation for fast, accurate llm inference profiling,

A. Liakopouloset al., “MaverIQ: Fingerprint-guided extrapolation and fragmentation for fast, accurate llm inference profiling,”Proceedings of the ACM/IEEE International Conference on Supercomputing (SC), 2025

work page 2025
[4]

A systematic characterization of LLM inference on GPUs,

H. Wang, X. Xiao, M. Yan, Z. Zhu, D. Han, D. Wang, W. Li, X. Ye, C. Hu, H. Chen, and G. Sun, “A systematic characterization of LLM inference on GPUs,”arXiv preprint arXiv:2512.01644, 2025

work page arXiv 2025
[5]

Investigating energy efficiency and performance trade-offs in llm inference across tasks and dvfs settings.arXiv preprint arXiv:2501.08219, 2025

P. J. Maliakel, S. Ilager, and I. Brandic, “Characterizing LLM inference energy-performance tradeoffs across workloads and GPU scaling,”arXiv preprint arXiv:2501.08219, 2025

work page arXiv 2025
[6]

Available: https://arxiv.org/pdf/2507.11417

M. ¨Ozcan, P. Wiesner, P. Weiß, and O. Kao, “Quantifying the energy consumption and carbon emissions of LLM inference via simulations,” inWorkshop on Performance and Energy Efficiency in Concurrent and Distributed Systems (PECS), Euro-PAR, 2025, arXiv:2507.11417

work page arXiv 2025
[7]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

work page 2009
[8]

SweetSpot: An analytical model for predicting energy efficiency of LLM inference,

H. P. Cavagnaet al., “SweetSpot: An analytical model for predicting energy efficiency of LLM inference,” 2025

work page 2025
[9]

Fine-grained energy prediction for parallelized LLM inference with PIE-P,

A. Duttet al., “Fine-grained energy prediction for parallelized LLM inference with PIE-P,” 2025

work page 2025
[10]

Llmco2: Advancing accurate carbon footprint prediction for llm inferences,

Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang, “Llmco2: Advancing accurate carbon footprint prediction for llm inferences,”ACM SIGENERGY Energy Informatics Review, vol. 5, no. 2, pp. 63– 68, 2025. [Online]. Available: https://doi.org/10.1145/3757892.3757901

work page doi:10.1145/3757892.3757901 2025
[11]

From prompts to power: Measuring the energy footprint of llm inference,

F. Caravaca, ´A. Cuevas, and R. Cuevas, “From prompts to power: Measuring the energy footprint of llm inference,” arXiv preprint arXiv:2511.05597, 2025. [Online]. Available: https://arxiv.org/abs/2511.05597

work page arXiv 2025
[12]

How hungry is

N. Jegham, M. Abdelatti, L. Elmoubarki, and A. Hendawi, “How hungry is ai? benchmarking energy, water, and carbon footprint of llm inference,”arXiv preprint arXiv:2505.09598, 2025. [Online]. Available: https://arxiv.org/abs/2505.09598

work page arXiv 2025
[13]

Fast inference from transformers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transformers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 19 274–19 286. [Online]. Available: https://proceedings.mlr.press/v202/leviathan23a.html

work page 2023
[14]

DynamoLLM: Designing LLM inference clusters for performance and energy efficiency,

J. Jovanovicet al., “DynamoLLM: Designing LLM inference clusters for performance and energy efficiency,” inIEEE HPCA, 2025

work page 2025
[15]

Clover: Toward sustainable ai with carbon-aware machine learning inference service,

B. Li, S. Samsi, V . Gadepally, and D. Tiwari, “Clover: Toward sustainable ai with carbon-aware machine learning inference service,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023. [Online]. Available: https://doi.org/10.1145/3581784.3607034

work page doi:10.1145/3581784.3607034 2023
[16]

Sophia Machine Overview,

Argonne Leadership Computing Facility, “Sophia Machine Overview,” https://www.alcf.anl.gov/sophia, 2024, accessed: 2024

work page 2024
[17]

Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,

S. Muralidharan, T. Applencourt, C. Bertoni, J. Kwack, Y . Luo, E. Rangel, J. Tramm, Y . Ghadaret al., “Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,”arXiv preprint arXiv:2509.08207, 2025. [Online]. Available: https://arxiv.org/abs/2509.08207

work page arXiv 2025
[18]

Chameleon: A large-scale reconfigurable testbed for computer science research,

K. Keaheyet al., “Chameleon: A large-scale reconfigurable testbed for computer science research,”ACM Computing Surveys, 2023, canonical Chameleon testbed reference

work page 2023
[19]

Efficient large-scale language model training on GPU clusters using Megatron-LM,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Ko- rthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on GPU clusters using Megatron-LM,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and A...

work page 2021
[20]

GPipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “GPipe: Efficient training of giant neural networks using pipeline parallelism,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[21]

PipeDream: Generalized pipeline parallelism for DNN training,

D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized pipeline parallelism for DNN training,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2019

work page 2019
[22]

A limited memory algo- rithm for bound constrained optimization,

R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algo- rithm for bound constrained optimization,”SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, 1995

work page 1995
[23]

Regression towards mediocrity in hereditary stature,

F. Galton, “Regression towards mediocrity in hereditary stature,”The Journal of the Anthropological Institute of Great Britain and Ireland, vol. 15, pp. 246–263, 1886

work page
[24]

Random forests,

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

work page 2001
[25]

Greedy function approximation: A gradient boosting machine,

J. H. Friedman, “Greedy function approximation: A gradient boosting machine,”The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001

work page 2001
[26]

A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,

M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,”Technometrics, vol. 21, no. 2, pp. 239–245, 1979

work page 1979

[1] [1]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

work page 2020

[2] [2]

Data-driven surrogate models for LLM energy consumption in distributed systems,

S. Cheng, M. M. Zavlanoset al., “Data-driven surrogate models for LLM energy consumption in distributed systems,”ACM Transactions on Embedded Computing Systems, 2025

work page 2025

[3] [3]

MaverIQ: Fingerprint-guided extrapolation and fragmentation for fast, accurate llm inference profiling,

A. Liakopouloset al., “MaverIQ: Fingerprint-guided extrapolation and fragmentation for fast, accurate llm inference profiling,”Proceedings of the ACM/IEEE International Conference on Supercomputing (SC), 2025

work page 2025

[4] [4]

A systematic characterization of LLM inference on GPUs,

H. Wang, X. Xiao, M. Yan, Z. Zhu, D. Han, D. Wang, W. Li, X. Ye, C. Hu, H. Chen, and G. Sun, “A systematic characterization of LLM inference on GPUs,”arXiv preprint arXiv:2512.01644, 2025

work page arXiv 2025

[5] [5]

Investigating energy efficiency and performance trade-offs in llm inference across tasks and dvfs settings.arXiv preprint arXiv:2501.08219, 2025

P. J. Maliakel, S. Ilager, and I. Brandic, “Characterizing LLM inference energy-performance tradeoffs across workloads and GPU scaling,”arXiv preprint arXiv:2501.08219, 2025

work page arXiv 2025

[6] [6]

Available: https://arxiv.org/pdf/2507.11417

M. ¨Ozcan, P. Wiesner, P. Weiß, and O. Kao, “Quantifying the energy consumption and carbon emissions of LLM inference via simulations,” inWorkshop on Performance and Energy Efficiency in Concurrent and Distributed Systems (PECS), Euro-PAR, 2025, arXiv:2507.11417

work page arXiv 2025

[7] [7]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

work page 2009

[8] [8]

SweetSpot: An analytical model for predicting energy efficiency of LLM inference,

H. P. Cavagnaet al., “SweetSpot: An analytical model for predicting energy efficiency of LLM inference,” 2025

work page 2025

[9] [9]

Fine-grained energy prediction for parallelized LLM inference with PIE-P,

A. Duttet al., “Fine-grained energy prediction for parallelized LLM inference with PIE-P,” 2025

work page 2025

[10] [10]

Llmco2: Advancing accurate carbon footprint prediction for llm inferences,

Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang, “Llmco2: Advancing accurate carbon footprint prediction for llm inferences,”ACM SIGENERGY Energy Informatics Review, vol. 5, no. 2, pp. 63– 68, 2025. [Online]. Available: https://doi.org/10.1145/3757892.3757901

work page doi:10.1145/3757892.3757901 2025

[11] [11]

From prompts to power: Measuring the energy footprint of llm inference,

F. Caravaca, ´A. Cuevas, and R. Cuevas, “From prompts to power: Measuring the energy footprint of llm inference,” arXiv preprint arXiv:2511.05597, 2025. [Online]. Available: https://arxiv.org/abs/2511.05597

work page arXiv 2025

[12] [12]

How hungry is

N. Jegham, M. Abdelatti, L. Elmoubarki, and A. Hendawi, “How hungry is ai? benchmarking energy, water, and carbon footprint of llm inference,”arXiv preprint arXiv:2505.09598, 2025. [Online]. Available: https://arxiv.org/abs/2505.09598

work page arXiv 2025

[13] [13]

Fast inference from transformers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transformers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 19 274–19 286. [Online]. Available: https://proceedings.mlr.press/v202/leviathan23a.html

work page 2023

[14] [14]

DynamoLLM: Designing LLM inference clusters for performance and energy efficiency,

J. Jovanovicet al., “DynamoLLM: Designing LLM inference clusters for performance and energy efficiency,” inIEEE HPCA, 2025

work page 2025

[15] [15]

Clover: Toward sustainable ai with carbon-aware machine learning inference service,

B. Li, S. Samsi, V . Gadepally, and D. Tiwari, “Clover: Toward sustainable ai with carbon-aware machine learning inference service,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023. [Online]. Available: https://doi.org/10.1145/3581784.3607034

work page doi:10.1145/3581784.3607034 2023

[16] [16]

Sophia Machine Overview,

Argonne Leadership Computing Facility, “Sophia Machine Overview,” https://www.alcf.anl.gov/sophia, 2024, accessed: 2024

work page 2024

[17] [17]

Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,

S. Muralidharan, T. Applencourt, C. Bertoni, J. Kwack, Y . Luo, E. Rangel, J. Tramm, Y . Ghadaret al., “Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,”arXiv preprint arXiv:2509.08207, 2025. [Online]. Available: https://arxiv.org/abs/2509.08207

work page arXiv 2025

[18] [18]

Chameleon: A large-scale reconfigurable testbed for computer science research,

K. Keaheyet al., “Chameleon: A large-scale reconfigurable testbed for computer science research,”ACM Computing Surveys, 2023, canonical Chameleon testbed reference

work page 2023

[19] [19]

Efficient large-scale language model training on GPU clusters using Megatron-LM,

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Ko- rthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on GPU clusters using Megatron-LM,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and A...

work page 2021

[20] [20]

GPipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “GPipe: Efficient training of giant neural networks using pipeline parallelism,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[21] [21]

PipeDream: Generalized pipeline parallelism for DNN training,

D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized pipeline parallelism for DNN training,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2019

work page 2019

[22] [22]

A limited memory algo- rithm for bound constrained optimization,

R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algo- rithm for bound constrained optimization,”SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, 1995

work page 1995

[23] [23]

Regression towards mediocrity in hereditary stature,

F. Galton, “Regression towards mediocrity in hereditary stature,”The Journal of the Anthropological Institute of Great Britain and Ireland, vol. 15, pp. 246–263, 1886

work page

[24] [24]

Random forests,

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

work page 2001

[25] [25]

Greedy function approximation: A gradient boosting machine,

J. H. Friedman, “Greedy function approximation: A gradient boosting machine,”The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001

work page 2001

[26] [26]

A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,

M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,”Technometrics, vol. 21, no. 2, pp. 239–245, 1979

work page 1979