pith. machine review for the scientific record. sign in

arxiv: 2605.14249 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: no theorem link

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:44 UTC · model grok-4.3

classification 💻 cs.LG
keywords energy predictionmulti-GPU inferenceLLM optimizationMoE modelingcompute-communication overlapenergy-aware serving
0
0 comments X

The pith

EnergyLens predicts multi-GPU LLM inference energy with 9-13 percent error to identify efficient configurations without exhaustive profiling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EnergyLens as a framework that models energy consumption for large language model inference across multiple GPUs. Current tools either demand production-level code and costly measurements or overlook the energy effects of parallelism and communication in distributed settings. EnergyLens supplies an einsum-based interface to specify model details like fusion and overlap, paired with targeted models for mixture-of-experts load imbalance and inter-GPU communication costs. On Llama3 and Qwen3-MoE it reports mean absolute percentage errors between 9.25 and 13.19 percent for prefill and decode phases. The framework also surfaces large energy differences across setups and locates Pareto-optimal overlap points that intuition alone misses.

Core claim

EnergyLens is an end-to-end framework for energy-aware LLM inference optimization that captures specifications including fusion, parallelism, and compute-communication overlap through an einsum-based interface, augments this with load-imbalance-aware MoE modeling and an empirically driven communication energy model, and delivers MAPE values of 9.25 to 13.19 percent on multi-GPU prefill and decode energy for Llama3 and Qwen3-MoE while correctly recovering Pareto-optimal overlap configurations.

What carries the argument

Einsum-based interface for specifying LLM fusion, parallelism, and compute-communication overlap, together with load-imbalance-aware MoE modeling and an empirically driven multi-GPU communication energy model.

If this is right

  • Energy consumption varies by up to 1.47x in prefill and 52.9x in decode across different overlap and parallelism choices.
  • Compute-communication overlap strategies that appear optimal by intuition are often not Pareto-optimal, and the framework identifies the better ones.
  • Distributed serving configurations become preferable once energy costs are quantified rather than guessed.
  • Practitioners can rank candidate optimizations and hardware allocations without running full production code or exhaustive profiling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modeling approach could be used to screen candidate parallelism schemes before any hardware is allocated.
  • Extending the empirical communication model to new interconnect technologies would immediately widen the set of configurations that can be compared.
  • Repeated application across successive model releases would accumulate a dataset for refining the communication energy equations without additional manual measurement.

Load-bearing premise

The empirically driven communication energy model and load-imbalance-aware MoE modeling generalize accurately to unseen multi-GPU configurations and model scales beyond the validation set.

What would settle it

Measure actual energy on a new tensor-parallel or expert-parallel configuration or larger model scale not present in the original validation runs and check whether the reported MAPE stays within 9-13 percent.

Figures

Figures reproduced from arXiv: 2605.14249 by Anantha P. Chandrakasan, Eun Kyung Lee, Kyungmi Lee, Tamar Eilam, Xin Zhang, Zhiye Song.

Figure 1
Figure 1. Figure 1: EnergyLens framework enables energy-latency optimization in the high-dimensional [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MoE routing statistics showing the average load and the load of the bottleneck GPU. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 8-GPU ReduceScatter is slower when fewer streaming multiprocessors (SMs) are dedi￾cated to communication. We propose an empirically-driven and overlap￾aware communication energy model. Unlike prior work that assumes fixed bandwidth utilization and only estimates latency (Lee et al., 2025b), we pro￾file latency and energy consumption across var￾ious communication kernels and transfer sizes and use interpola… view at source ↗
Figure 4
Figure 4. Figure 4: EnergyLens enables intuitive specification of parallelism and overlap settings. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Llama3-70B prefill phase at B2, ISL=4096 across tensor parallelism configurations (TP2, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3-30B-A3B decode phase at TP2 in attention layers and EP2 in MoE layers, with [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Energy-latency trade-offs for Llama3-70B. Configurations exceeding GPU memory are [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Llama3-70B (ISL=4096): the pre￾dicted Pareto-optimal points fall on the Pareto Front of 55 measured configurations with vary￾ing overlap settings, batch size, and tensor paral￾lelism. Overlap introduces another optimization dimen￾sion beyond tensor parallelism and batch size. In￾stead of relying on intuition to prioritize maxi￾mum overlap, EnergyLens predicts the full con￾figuration space and identifies th… view at source ↗
Figure 9
Figure 9. Figure 9: Observed GPU power consumption varies by up to 60% in the decode phase, not captured [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Total energy and latency of AllReduce for 2, 4, and 8 GPUs. Overhead dominates at small workload sizes, resulting in higher latency and energy per bit. This behavior is missed by bandwidth-and-queuing-based models used in prior works (Lee et al., 2025b; Li et al., 2023). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The energy to first token (ETFT) prediction is validated on single-GPU Llama3-8B [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The energy per request prediction is validated on single-GPU Llama3-8B inferences with [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Energy and latency breakdown for Llama3-70B decode phase at TP4. (a) Energy [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qwen3-30B-A3B prefill phase at B2, ISL=4096 across [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: EnergyLens allows developers to rapidly assess unfused and fused implementations. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: EnergyLens enables fast evaluation of kernel fusion. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
read the original abstract

We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. EnergyLens is an end-to-end framework for energy-aware LLM inference optimization on multi-GPU systems. It provides an einsum-based interface to specify LLM computations (including fusion, tensor/expert parallelism, and compute-communication overlap), augments this with a load-imbalance-aware MoE energy model and an empirically driven multi-GPU communication energy model, and uses the resulting predictor to explore energy-efficient configurations. On Llama3 and Qwen3-MoE the framework reports MAPEs of 9.25–13.19 % for prefill and decode energy across tensor-parallel and expert-parallel settings, plus 12.97 % MAPE across SM allocations for Megatron-style overlap; it also claims to identify Pareto-optimal overlap points and to reveal up to 1.47× and 52.9× energy variation across configurations.

Significance. If the predictive models generalize, the work supplies a practical, low-overhead tool for ranking deployment choices and overlap strategies without exhaustive hardware profiling, directly addressing the sustainability and operational cost of large-scale LLM serving.

major comments (3)
  1. [§5] §5 (Evaluation): The reported MAPE ranges (9.25–13.19 %) are presented without error bars, without a description of the data-exclusion policy, and without any statement of whether the communication-energy-model coefficients were fitted on the same traces later used for validation; this leaves open the possibility that the quoted accuracy partly reflects in-sample fit rather than out-of-sample prediction.
  2. [§4.2] §4.2 (Communication model): The text states that the multi-GPU communication energy model is “empirically driven” yet supplies no feature set, fitting procedure, regularization, or cross-validation protocol; without these details the claim that EnergyLens “correctly identifies Pareto-optimal overlap configurations” cannot be verified, because relative errors of 10–13 % could still invert the ranking of candidate points.
  3. [§4.3] §4.3 (MoE modeling): The load-imbalance-aware MoE component is likewise described as empirically driven, but no quantitative definition of imbalance, no validation across expert-parallel degrees, and no held-out model-scale experiments are provided; this is load-bearing for the generalization claim to “unseen multi-GPU configurations.”
minor comments (2)
  1. [Abstract] Abstract: the three MAPE numbers are given as a single range without mapping each value to a concrete model/phase/parallelism combination, reducing immediate readability.
  2. [§5] Figure captions and §5: several plots lack axis labels for energy units or explicit legend entries for the different overlap strategies being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the transparency and rigor of our modeling and evaluation sections. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): The reported MAPE ranges (9.25–13.19 %) are presented without error bars, without a description of the data-exclusion policy, and without any statement of whether the communication-energy-model coefficients were fitted on the same traces later used for validation; this leaves open the possibility that the quoted accuracy partly reflects in-sample fit rather than out-of-sample prediction.

    Authors: We agree that §5 lacks error bars, an explicit data-exclusion policy, and confirmation of out-of-sample validation. In the revised manuscript we will add error bars computed across repeated runs, describe the trace-splitting procedure (separate fitting and validation sets), and state that the reported MAPEs reflect held-out evaluation. These additions will directly address the concern about in-sample fit. revision: yes

  2. Referee: [§4.2] §4.2 (Communication model): The text states that the multi-GPU communication energy model is “empirically driven” yet supplies no feature set, fitting procedure, regularization, or cross-validation protocol; without these details the claim that EnergyLens “correctly identifies Pareto-optimal overlap configurations” cannot be verified, because relative errors of 10–13 % could still invert the ranking of candidate points.

    Authors: The referee is correct that the communication-model description in §4.2 is incomplete. We will expand this section to specify the feature set (message size, GPU count, bandwidth), the fitting procedure (regularized linear regression), and the cross-validation protocol. With these details readers can evaluate whether the 10–13 % error is sufficient to preserve the reported Pareto ranking. revision: yes

  3. Referee: [§4.3] §4.3 (MoE modeling): The load-imbalance-aware MoE component is likewise described as empirically driven, but no quantitative definition of imbalance, no validation across expert-parallel degrees, and no held-out model-scale experiments are provided; this is load-bearing for the generalization claim to “unseen multi-GPU configurations.”

    Authors: We acknowledge that §4.3 requires additional rigor. In the revision we will supply a quantitative definition of load imbalance (variance in expert utilization), report validation results across multiple expert-parallel degrees, and include held-out experiments at different model scales to support the generalization claim. revision: yes

Circularity Check

1 steps flagged

Empirically fitted communication and MoE energy models reduce reported MAPEs to in-sample fit quality rather than independent prediction

specific steps
  1. fitted input called prediction [Abstract]
    "combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy"

    The communication energy model and MoE model are described as empirically driven (i.e., parameters fitted to profiled data). Validation is performed on the identical Llama3 and Qwen3-MoE tensor- and expert-parallel configurations, so the reported MAPE quantifies how well the fitted parameters reproduce the training measurements rather than predicting unseen hardware or model scales.

full rationale

The paper's accuracy claims (MAPE 9.25-13.19%) and Pareto-identification rest on two components explicitly labeled 'empirically driven' and 'load-imbalance-aware'. These are fitted to measurements on the exact Llama3/Qwen3-MoE tensor- and expert-parallel setups used for validation. No held-out configurations, regularization details, or out-of-sample protocol are supplied in the abstract or validation description, so the low errors are consistent with fitting rather than generalization. This matches the 'fitted input called prediction' pattern and raises the score to 6; the remainder of the framework (einsum interface, overlap exploration) is not shown to be circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Framework rests on domain assumptions about the fidelity of the einsum abstraction and on empirical fitting for communication costs; no new physical entities are postulated.

free parameters (1)
  • communication energy model coefficients
    Empirically driven model requires fitted parameters from hardware measurements.
axioms (1)
  • domain assumption The einsum-based interface fully captures fusion, parallelism, and compute-communication overlap behavior in LLM inference.
    Central modeling choice stated in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1262 out tokens · 25470 ms · 2026-05-15T02:44:12.207532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

  1. [1]

    Neutrino Production via $e^-e^+$ Collision at $Z$-boson Peak

    ISSN 21674337. doi: 10.1109/SC41404.2022.00051. Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, and Jongse Park. LLMServingSim: A HW/SW Co- Simulation Infrastructure for LLM Inference Serving at Scale. In2024 IEEE International Symposium on Workload Characterization (IISWC), pages 15–29, September

  2. [2]

    URLhttp://arxiv.org/abs/2408.05499

    doi: 10.1109/IISWC63097.2024.00012. URLhttp://arxiv.org/abs/2408.05499. arXiv:2408.05499 [cs]. Jaehong Cho, Hyunmin Choi, Guseul Heo, and Jongse Park. LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure, March

  3. [3]

    arXiv:2602.23036 [cs]

    URL http://arxiv.org/ abs/2602.23036. arXiv:2602.23036 [cs]. Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Osi, Parteek Sharma, Fan Chen, and Lei Jiang. LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models. pages 1–15,

  4. [4]

    arXiv: 2309.14393

    URL http: //arxiv.org/abs/2309.14393. arXiv: 2309.14393. Zhenxiao Fu, Fan Chen, Shan Zhou, Haitong Li, and Lei Jiang. LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences.ACM SIGENERGY Energy Informatics Review, 5(2),

  5. [5]

    arXiv: 2002.05651

    ISSN 15337928. arXiv: 2002.05651. Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yichong Zhang, Zhenhua Zhu, Guohao Dai, and Yu Wang. Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering, October

  6. [6]

    arXiv:2504.19519 [cs]

    URL http://arxiv.org/abs/2504.19519. arXiv:2504.19519 [cs]. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Ti...

  7. [7]

    doi: 10.1145/3466752.3480063

    ISSN 10724451. doi: 10.1145/3466752.3480063. ISBN: 9781450385572. Alexandre Lacoste and Thomas Dandres. Quantifying the Carbon Emissions of Machine Learning

  8. [8]

    Quantifying the Carbon Emissions of Machine Learning

    arXiv: 1910.09700v2. Kyungmi Lee, Zhiye Song, Eun Kyung Lee, Xin Zhang, Tamar Eilam, and Anantha P. Chandrakasan. EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads.IEEE International Symposium on Performance Analysis of Systems and Software,

  9. [9]

    Characterizing Compute- Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implica- tions, July 2025a

    Seonho Lee, Jihwan Oh, Junkyum Kim, Seokjin Go, Jongse Park, and Divya Mahajan. Characterizing Compute- Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implica- tions, July 2025a. URLhttp://arxiv.org/abs/2507.03114. arXiv:2507.03114 [cs]. Seonho Lee, Amar Phanishayee, and Divya Mahajan. Forecasting GPU Performance...

  10. [10]

    ISBN 979-8-4007-0329-4

    ACM. ISBN 979-8-4007-0329-4. doi: 10.1145/3613424. 3614277. URLhttps://dl.acm.org/doi/10.1145/3613424.3614277. 11 Ying Li, Yuhui Bao, Gongyu Wang, Xinxin Mei, Pranav Vaid, Anandaroop Ghosh, Adwait Jog, Darius Bunandar, Ajay Joshi, and Yifan Sun. TrioSim: A Lightweight Simulator for Large-Scale DNN Workloads on Multi-GPU Systems. InProceedings of the 52nd ...

  11. [11]

    ISBN 979-8-4007-1261-6

    ACM. ISBN 979-8-4007-1261-6. doi: 10.1145/3695053.3731082. URL https://dl.acm.org/doi/10.1145/3695053.3731082. Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, and Christina Delimitrou. Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training. InMLSys Conference,

  12. [12]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    URL http://arxiv.org/abs/1909.08053. arXiv: 1909.08053. Emma Strubell, Ananya Ganesh, Andrew McCallum, Ananya Ganesh, and Andrew McCallum. Energy and Policy Considerations for Modern Deep Learning Research.Proceedings of the AAAI Conference on Artificial Intelligence, 34(09):13693–13696,

  13. [13]

    In: Korhonen, A., Traum, D., Màrquez, L

    doi: 10.1609/aaai.v34i09.7123. URL https://ojs.aaai.org/ index.php/AAAI/article/view/7123. arXiv: 1906.02243v1. Arya Tschand, Arun Tejusve Raghunath Rajan, Sachin Idgunji, Anirban Ghosh, Jeremy Holleman, Csaba Kiraly, Pawan Ambalkar, Ritika Borkar, Ramesh Chukka, Trevor Cockrell, Oliver Curtis, Grigori Fursin, Miro Hodak, Hiwot Kassa, Anton Lokhmotov, Dej...

  14. [14]

    arXiv: 2410.12032

    URL http://arxiv.org/abs/2410.12032. arXiv: 2410.12032. William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale.Proceedings - 2023 IEEE International Symposium on Performance Analysis of Systems and Softw...

  15. [15]

    arXiv: 2303.14006 ISBN: 9798350397390

    doi: 10.1109/ISPASS57527.2023.00035. arXiv: 2303.14006 ISBN: 9798350397390. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Y...

  16. [16]

    Qwen3 Technical Report

    URL http://arxiv.org/abs/2505.09388. arXiv:2505.09388 [cs]. Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation. arXiv, June

  17. [17]

    bsm,mHh->bsHh

    doi: 10.48550/arXiv.2401.09670. URLhttp://arxiv.org/abs/2401.09670. arXiv:2401.09670 [cs]. 12 A Motivation - Power Measurements Figure 9: Observed GPU power consumption varies by up to 60% in the decode phase, not captured by TDP. Llama3-70B inference on different numbers of A100-80GB GPUs (TP2, TP4, and TP8) and batch sizes (B2, B8 and B32). Latency mult...

  18. [18]

    bKrsh,bKzh->bKrsz

    1attn_eqs = [ 2op("bKrsh,bKzh->bKrsz", parallel="K", label="QK"), 3op("bKrsz,bKzh->bKrsh", parallel="K", label="AV") 4] 14 Table 2: LLM symbols used in Listings 1 to 3 Symbol Description rQ head to KV head ratio in grouped query attention bBatch size sInput length in prefill; 1 in decode hHead dimension HNumber of query heads KNumber of key/value heads tB...

  19. [19]

    Figure 11 plots the predicted ETFT, normalized per request

    We include all observed ISL-batch-size pairs in the MAPE calculation. Figure 11 plots the predicted ETFT, normalized per request. Since these operations already have high arithmetic intensity, ETFT is largely insensitive to batch size. EnergyLens closely matches measurements, achieving a MAPE of 11.31%. Decode behaves very differently from prefill. Since ...

  20. [20]

    bKrsh,bKzh->bKrsz

    An example specification of the fused dense transformer with CP is provided below. We validated EnergyLens’s support for context parallelism (the variety proposed by DeepSpeed- Ulysses) on Llama3-8B. This is tested on CP2 with the same sweep settings described in Appendix M, achieving MAPEs of 14.69% and 12.58% for energy and latency, respectively. The co...

  21. [21]

    wikipedia

    25.45% Li et al. (2023) 210.59% NeuSight (Lee et al., 2025b) 25.69% In the decode phase of LLM inference, GEMM kernels exhibit low arithmetic intensity and skewed matrix shapes that challenge existing kernel latency estimation tools (Li et al., 2023; Lee et al., 2025b, 2026). To assess whether this limitation stems from our default backend, we leverage En...

  22. [22]

    New" or

    Actual runtime batch sizes used by TensorRT-LLM at long contexts were verified with Torch Profiler, and all observed batch-size/sequence-length pairs were included in the MAPE calculation. The Llama3-70B overlap MAPE results were obtained with Megatron-style compute-communication overlap in the prefill phase. Overlap configurations including no overlap an...