EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving
Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3
The pith
Symbolic regression on 50 energy measurements produces a single twelve-parameter closed-form model that selects energy-optimal LLM inference configurations across models and hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EnergyLens derives a single twelve-parameter closed-form energy model from profiling data that expresses inference energy in terms of degree of parallelism, batch size, and sequence length. The model explicitly decouples tensor and pipeline parallelism contributions and isolates prefill energy from decode energy, yielding predictions that remain interpretable and actionable for configuration selection.
What carries the argument
The twelve-parameter closed-form energy model discovered by symbolic regression, which isolates tensor versus pipeline parallelism and prefill versus decode phases to produce physically interpretable energy predictions.
If this is right
- The model achieves 88.2 percent top-1 accuracy in selecting energy-optimal configurations using only 50 profiling samples.
- It matches the accuracy of ensemble machine-learning methods while requiring roughly ten times fewer measurements.
- The same functional form extrapolates to batch sizes and hardware platforms not present in the original profiling set.
- Decoupling parallelism types and prefill/decode phases supplies concrete guidance for choosing parallelism strategies that reduce energy without sacrificing throughput.
Where Pith is reading between the lines
- Deployment systems could embed the model to evaluate candidate configurations at runtime before launching an inference job.
- The separation of prefill and decode terms suggests analogous closed-form models could be derived for other resources such as memory bandwidth.
- Because the structure is fixed, the approach invites testing whether the same twelve-parameter skeleton applies to emerging state-space model architectures with only coefficient updates.
Load-bearing premise
A single functional form fitted from limited profiling runs will remain valid and generalize across model architectures, hardware platforms, and multimodal workloads without structural changes or additional parameters.
What would settle it
Observing that the fixed twelve-parameter formula produces large prediction errors on energy measurements from a previously unseen model family or accelerator when the same coefficients are used without refitting.
Figures
read the original abstract
As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents EnergyLens, which applies symbolic regression to profiling data to obtain a single twelve-parameter closed-form energy model for LLM inference. The model decouples tensor and pipeline parallelism and separates prefill and decode phases. It reports achieving 88.2% top-1 accuracy in selecting energy-optimal configurations using as few as 50 samples, outperforming prior analytical methods (60.9%) and matching ensemble ML methods with 10x fewer samples, while extrapolating to unseen batch sizes and hardware platforms without changing the model structure.
Significance. If validated, the work would offer an interpretable and sample-efficient alternative to black-box energy models for optimizing LLM serving on heterogeneous hardware, potentially aiding practical deployment decisions where energy is a key constraint. The closed-form nature and physical interpretability are strengths compared to opaque ML surrogates.
major comments (3)
- Abstract: The abstract states concrete accuracy figures (88.2% Top-1, 60.9% baseline) and sample counts (50 profiling measurements) but provides no information on measurement methodology, statistical significance, cross-validation procedure, or how extrapolation was quantified; without these the central performance claims cannot be verified.
- §4 (Evaluation): The twelve parameters are obtained by fitting to the same profiling data used for evaluation, so any reported accuracy or extrapolation performance is necessarily dependent on those fitted values rather than an independent derivation; the symbolic-regression step discovers structure but does not remove the circularity.
- §3.2 (Symbolic Regression): The central claim requires that symbolic regression produces one fixed functional skeleton whose algebraic structure remains unchanged across model architectures (dense, MoE, state-space), hardware platforms, and multimodal workloads. No evidence is provided that the identical equation skeleton was recovered or imposed across the reported evaluation scenarios; if sensitive to the initial profiling distribution, refitting coefficients alone would fail to recover the claimed 88.2% accuracy and 10× sample-efficiency.
minor comments (2)
- Notation throughout: The explicit decoupling of tensor/pipeline parallelism and prefill/decode contributions is described conceptually but the precise algebraic terms for each (e.g., how batch size and sequence length enter each component) could be stated with numbered equations for clarity.
- Figures 3-5: Axis labels and legends should explicitly note the number of profiling runs and whether the plotted points are in-sample or held-out to support the extrapolation claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps improve the clarity and rigor of our claims regarding EnergyLens. We address each major comment below and will incorporate revisions to address concerns about verifiability and methodological details.
read point-by-point responses
-
Referee: Abstract: The abstract states concrete accuracy figures (88.2% Top-1, 60.9% baseline) and sample counts (50 profiling measurements) but provides no information on measurement methodology, statistical significance, cross-validation procedure, or how extrapolation was quantified; without these the central performance claims cannot be verified.
Authors: We agree the abstract is overly concise and will revise it to briefly note the measurement methodology (power metering on heterogeneous GPUs during profiling runs), indicate that results are from 5-fold cross-validation, and state that extrapolation tests use held-out batch sizes and platforms. Full statistical details and procedures remain in §4, but this addition will make the abstract self-contained for the key claims. revision: yes
-
Referee: §4 (Evaluation): The twelve parameters are obtained by fitting to the same profiling data used for evaluation, so any reported accuracy or extrapolation performance is necessarily dependent on those fitted values rather than an independent derivation; the symbolic-regression step discovers structure but does not remove the circularity.
Authors: This is a valid observation about potential circularity in the current presentation. We will revise §4 to explicitly describe our protocol: symbolic regression and coefficient fitting are performed on training folds (80% of the 50 samples), with top-1 accuracy and extrapolation metrics computed on held-out test folds and entirely separate profiling runs for unseen batch sizes/hardware. We will report per-fold results and confirm no data leakage in the extrapolation experiments. revision: yes
-
Referee: §3.2 (Symbolic Regression): The central claim requires that symbolic regression produces one fixed functional skeleton whose algebraic structure remains unchanged across model architectures (dense, MoE, state-space), hardware platforms, and multimodal workloads. No evidence is provided that the identical equation skeleton was recovered or imposed across the reported evaluation scenarios; if sensitive to the initial profiling distribution, refitting coefficients alone would fail to recover the claimed 88.2% accuracy and 10× sample-efficiency.
Authors: We will revise §3.2 and add a new table in the main text (or appendix) showing the functional forms recovered when symbolic regression is run independently on data subsets for dense, MoE, and state-space models across hardware platforms. The identical twelve-parameter skeleton emerged consistently without being imposed a priori. We will also include a brief sensitivity analysis to initial profiling distributions to demonstrate robustness of the structure. revision: yes
Circularity Check
12-parameter model fitted to profiling data; accuracy and extrapolation claims reduce to fit quality on same data
specific steps
-
fitted input called prediction
[Abstract]
"Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification."
The twelve parameters (and the functional skeleton) are obtained by fitting to the profiling measurements; the accuracy, sample-efficiency, and extrapolation figures are measured on configurations drawn from the same or closely related profiling runs, so the performance numbers are a direct consequence of how well the fitted model reproduces its own input data rather than an independent test of a derivation.
full rationale
The derivation chain begins with symbolic regression applied to ~50 profiling runs to discover a 12-parameter functional form, followed by coefficient fitting on those same measurements. The headline claims of 88.2% Top-1 accuracy, 10x sample efficiency versus ensembles, and reliable extrapolation to unseen batches/hardware are then asserted after this fitting step. Because the evaluation scenarios draw from the identical profiling distribution used to obtain both structure and coefficients, the reported performance metrics are statistically dependent on the input data rather than arising from an independent first-principles derivation. This matches the fitted-input-called-prediction pattern; no self-citation load-bearing or ansatz smuggling is required to reach the reduction. The result is partial circularity: the model is useful as a compact surrogate but its claimed predictive power is not independent of the fitting data.
Axiom & Free-Parameter Ledger
free parameters (1)
- twelve model parameters
axioms (1)
- domain assumption Energy consumption can be additively decomposed into prefill and decode phases whose contributions are independent of each other once parallelism settings are fixed.
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901
work page 2020
-
[2]
Data-driven surrogate models for LLM energy consumption in distributed systems,
S. Cheng, M. M. Zavlanoset al., “Data-driven surrogate models for LLM energy consumption in distributed systems,”ACM Transactions on Embedded Computing Systems, 2025
work page 2025
-
[3]
A. Liakopouloset al., “MaverIQ: Fingerprint-guided extrapolation and fragmentation for fast, accurate llm inference profiling,”Proceedings of the ACM/IEEE International Conference on Supercomputing (SC), 2025
work page 2025
-
[4]
A systematic characterization of LLM inference on GPUs,
H. Wang, X. Xiao, M. Yan, Z. Zhu, D. Han, D. Wang, W. Li, X. Ye, C. Hu, H. Chen, and G. Sun, “A systematic characterization of LLM inference on GPUs,”arXiv preprint arXiv:2512.01644, 2025
-
[5]
P. J. Maliakel, S. Ilager, and I. Brandic, “Characterizing LLM inference energy-performance tradeoffs across workloads and GPU scaling,”arXiv preprint arXiv:2501.08219, 2025
-
[6]
Available: https://arxiv.org/pdf/2507.11417
M. ¨Ozcan, P. Wiesner, P. Weiß, and O. Kao, “Quantifying the energy consumption and carbon emissions of LLM inference via simulations,” inWorkshop on Performance and Energy Efficiency in Concurrent and Distributed Systems (PECS), Euro-PAR, 2025, arXiv:2507.11417
-
[7]
Roofline: An insightful visual performance model for multicore architectures,
S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009
work page 2009
-
[8]
SweetSpot: An analytical model for predicting energy efficiency of LLM inference,
H. P. Cavagnaet al., “SweetSpot: An analytical model for predicting energy efficiency of LLM inference,” 2025
work page 2025
-
[9]
Fine-grained energy prediction for parallelized LLM inference with PIE-P,
A. Duttet al., “Fine-grained energy prediction for parallelized LLM inference with PIE-P,” 2025
work page 2025
-
[10]
Llmco2: Advancing accurate carbon footprint prediction for llm inferences,
Z. Fu, F. Chen, S. Zhou, H. Li, and L. Jiang, “Llmco2: Advancing accurate carbon footprint prediction for llm inferences,”ACM SIGENERGY Energy Informatics Review, vol. 5, no. 2, pp. 63– 68, 2025. [Online]. Available: https://doi.org/10.1145/3757892.3757901
-
[11]
From prompts to power: Measuring the energy footprint of llm inference,
F. Caravaca, ´A. Cuevas, and R. Cuevas, “From prompts to power: Measuring the energy footprint of llm inference,” arXiv preprint arXiv:2511.05597, 2025. [Online]. Available: https://arxiv.org/abs/2511.05597
-
[12]
N. Jegham, M. Abdelatti, L. Elmoubarki, and A. Hendawi, “How hungry is ai? benchmarking energy, water, and carbon footprint of llm inference,”arXiv preprint arXiv:2505.09598, 2025. [Online]. Available: https://arxiv.org/abs/2505.09598
-
[13]
Fast inference from transformers via speculative decoding,
Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transformers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, pp. 19 274–19 286. [Online]. Available: https://proceedings.mlr.press/v202/leviathan23a.html
work page 2023
-
[14]
DynamoLLM: Designing LLM inference clusters for performance and energy efficiency,
J. Jovanovicet al., “DynamoLLM: Designing LLM inference clusters for performance and energy efficiency,” inIEEE HPCA, 2025
work page 2025
-
[15]
Clover: Toward sustainable ai with carbon-aware machine learning inference service,
B. Li, S. Samsi, V . Gadepally, and D. Tiwari, “Clover: Toward sustainable ai with carbon-aware machine learning inference service,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023. [Online]. Available: https://doi.org/10.1145/3581784.3607034
-
[16]
Argonne Leadership Computing Facility, “Sophia Machine Overview,” https://www.alcf.anl.gov/sophia, 2024, accessed: 2024
work page 2024
-
[17]
Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,
S. Muralidharan, T. Applencourt, C. Bertoni, J. Kwack, Y . Luo, E. Rangel, J. Tramm, Y . Ghadaret al., “Aurora: Architecting Argonne’s First Exascale Supercomputer for Science and Discovery,”arXiv preprint arXiv:2509.08207, 2025. [Online]. Available: https://arxiv.org/abs/2509.08207
-
[18]
Chameleon: A large-scale reconfigurable testbed for computer science research,
K. Keaheyet al., “Chameleon: A large-scale reconfigurable testbed for computer science research,”ACM Computing Surveys, 2023, canonical Chameleon testbed reference
work page 2023
-
[19]
Efficient large-scale language model training on GPU clusters using Megatron-LM,
D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Ko- rthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia, “Efficient large-scale language model training on GPU clusters using Megatron-LM,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and A...
work page 2021
-
[20]
GPipe: Efficient training of giant neural networks using pipeline parallelism,
Y . Huang, Y . Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “GPipe: Efficient training of giant neural networks using pipeline parallelism,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[21]
PipeDream: Generalized pipeline parallelism for DNN training,
D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized pipeline parallelism for DNN training,” inProceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2019
work page 2019
-
[22]
A limited memory algo- rithm for bound constrained optimization,
R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algo- rithm for bound constrained optimization,”SIAM Journal on Scientific Computing, vol. 16, no. 5, pp. 1190–1208, 1995
work page 1995
-
[23]
Regression towards mediocrity in hereditary stature,
F. Galton, “Regression towards mediocrity in hereditary stature,”The Journal of the Anthropological Institute of Great Britain and Ireland, vol. 15, pp. 246–263, 1886
-
[24]
L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001
work page 2001
-
[25]
Greedy function approximation: A gradient boosting machine,
J. H. Friedman, “Greedy function approximation: A gradient boosting machine,”The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001
work page 2001
-
[26]
M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,”Technometrics, vol. 21, no. 2, pp. 239–245, 1979
work page 1979
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.