EnerInfer: Energy-Aware On-Device LLM Inference

Binqi Sun; Bohua Zou; Debayan Roy; Haibo Chen; Matteo Mascherin; Nian Liu; Ning Jia; Yu Peng; Yutao Liu

arxiv: 2606.23001 · v2 · pith:A5IZOGZYnew · submitted 2026-06-22 · 💻 cs.SE · cs.LG· cs.OS

EnerInfer: Energy-Aware On-Device LLM Inference

Bohua Zou , Nian Liu , Binqi Sun , Matteo Mascherin , Debayan Roy , Yutao Liu , Yu Peng , Ning Jia

show 1 more author

Haibo Chen

This is my paper

Pith reviewed 2026-06-26 07:55 UTC · model grok-4.3

classification 💻 cs.SE cs.LGcs.OS

keywords on-device LLM inferenceenergy efficiencyNPU frequency scalingthermal managementquality of experiencepower predictionmodel structure

0 comments

The pith

EnerInfer predicts throughput and power from model structure to select energy-efficient NPU and memory frequencies for on-device LLM inference without QoE loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-device LLM inference often contains slack in hardware frequency settings where modestly lower NPU and memory speeds preserve response quality while cutting energy use and heat. It claims that per-model profiling and component sensors are impractical on commercial devices, so EnerInfer replaces them with predictions based on model structure plus lightweight online feedback. This allows the system to choose efficient configurations under interference and to switch modes for thermal limits using short-horizon temperature forecasts. The result is reported energy-efficiency gains of up to 65 percent on phones, 12 percent on a laptop, and 24 percent on a development board across real LLMs.

Core claim

EnerInfer is the first on-device LLM inference framework that jointly manages energy efficiency, throughput, and thermal comfort by replacing per-model profiling with disaggregated model-structure-aware prediction of throughput and power, ranking-driven online feedback for configuration selection, and limited-horizon thermal prediction for dynamic mode switching.

What carries the argument

Disaggregated model-structure-aware prediction of throughput and power with ranking-driven online feedback and limited-horizon thermal prediction to select NPU/DDR frequency settings.

Load-bearing premise

Predictions of throughput and power derived from model structure generalize accurately to unseen LLMs and changing runtime conditions without per-model profiling or component-level sensors.

What would settle it

Run an unseen LLM on a phone, apply the predicted efficient frequency setting under typical interference, and measure whether energy use drops by the claimed amount while response latency and thermal limits stay within QoE bounds.

Figures

Figures reproduced from arXiv: 2606.23001 by Binqi Sun, Bohua Zou, Debayan Roy, Haibo Chen, Matteo Mascherin, Nian Liu, Ning Jia, Yu Peng, Yutao Liu.

**Figure 1.** Figure 1: Component-wise power consumption of LLM-based text polishing on a phone under the default settings and our method for on-device inference, as well as a cloud-offloaded inference. continues to erode battery life and heighten battery anxiety, undermining their practicality in everyday mobile scenarios [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Time breakdown of decoding a token in LLaMA2–1.3B. Layers 0 50 Heads 0 25 KV group 0 5 Hidden size 0 5k FFN ratio 0 5 Vocab 0 200k [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Box plot of energy efficiency rankings (1st, 2nd, etc.) of hardware configurations across LLMs on the phone. Each box summarizes results from 300 LLMs. 𝑀𝑥 and 𝑁𝑥 denote the 𝑥-th frequency level of Mem and NPU, where 𝑥 = 0 corresponds to the lowest available frequency level (the same convention applies below). The laptop and the board exhibit similarly large variation. the maximum frequencies (upper right c… view at source ↗

**Figure 8.** Figure 8: Left: Box plot of average power (NPU+Mem) across models at different frequencies on the laptop, showing significant variation. Each box reflects inter-model variation, not temporal fluctuation. Right: Power distribution across 300 models at setting M6N3. Other platforms show similar trends and thus are omitted. notable variations in dynamic power even under the same settings [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 7.** Figure 7: Strip plot of throughput normalized to the highest across LLMs and configurations on the laptop. Each strip contains 300 LLMs. The phone/board show similar trends, thus omitted. Insight 2: The efficiency ranking of hardware configurations is model-dependent and non-monotonic, which necessitates an accurate throughput and power prediction. Peak and scaled decoding throughput. As the number of on-device LLM… view at source ↗

**Figure 9.** Figure 9: Overview of EnerInfer. ML models are employed to predict the throughput and power of unseen LLMs across hardware configurations to choose the most energy efficient one that meets the QoE requirement. A runtime thermal predictor is adopted to dynamically enable or disable a thermal-aware controller. 4 Design of EnerInfer 4.1 Overview To enable energy-efficient on-device LLM inference, we propose EnerInfer… view at source ↗

**Figure 10.** Figure 10: Accuracy of throughput and power prediction, the Kendall’s Tau correlation of predicted efficiency (the closer to 1, the better), showing high accuracy in predicting the efficiency ranking. The dotted line shows a 10% error margin. degradation relative to the peak value, and the baseline monolithic predictor uses the same set of inputs. Accuracy. Figure 10a shows the prediction accuracy of throughput und… view at source ↗

**Figure 11.** Figure 11: Thermal prediction accuracy in the test dataset. Dotted line: 0.5℃ error margin. It can accurately predict the temperature over the next 1-21 seconds. denotes the NPU and Mem frequency settings, and 𝐽𝑁 represents the cost accumulated over 𝑁 steps. 𝑢 ∗ = arg min 𝑢 𝐽𝑁 (1) One component of the cost function is the negative value of the tokens generated before the temperature threshold, encouraging the contr… view at source ↗

**Figure 12.** Figure 12: MAPE across frequencies and Kendall’s Tau (𝜏, the closer to 1 the better) between predicted and ground truth in unseen realworld LLMs. G, L, and Q refer to Gemma2, LLaMA2/3.2, and Qwen2. Baselines. We select the Default configuration to reflect the behavior of "on-demand" governors, which drive the NPU and DDR to their maximum frequencies under the sustained high load of LLM inference. To evaluate the en… view at source ↗

**Figure 13.** Figure 13: Actual efficiency and throughput of EnerInfer across speed targets, using predicted results, compared to an oracle with ground-truth measurements. Shaded regions mark a practical QoE > 5 tokens/s. EnerInfer closely matches oracle across QoE targets. during the decoding phase. Oracle serves as the upper bound [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: The shell temperature and decoding throughput under a back-to-back inference scenario before it reaches thermal threshold. Default: default max. frequency setting. Ener: energy-aware setting without thermal management. EnerInfer(QoE): our method with QoE constraint. EnerInfer: our method without QoE constraint. EnerInfer Others NPU+Mem Default Others NPU+Mem Polish Conv 0 30 60 90 120 Energy Consumption (… view at source ↗

**Figure 15.** Figure 15: End-to-end total energy reduction by EnerInfer in realworld scenarios. Long (∼50%) post-inference display time dilutes the gains. NPU+Mem shows the inference energy. 6.4 Real-world deployment [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

read the original abstract

On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding speed, implicitly assuming that faster execution is always preferable. We show instead that on-device LLM inference often has exploitable configuration slack: modestly lowering NPU and memory frequencies preserves quality of experience (QoE) while substantially improving energy efficiency and reducing heat. Realizing this opportunity in production is challenging. The most energy-efficient NPU/DDR setting varies with the model, inference engine, platform, and runtime conditions, with no stable ranking across configurations. Commercial devices further lack component-level power sensing, and shell temperature evolves with request arrivals, response lengths, and thermal history. To address these challenges, we propose EnerInfer, the first on-device LLM inference framework that jointly manages energy efficiency, throughput, and thermal comfort for LLM workloads. EnerInfer replaces per-model profiling and sensor-heavy control with disaggregated, model-structure-aware prediction and ranking-driven online feedback. It predicts throughput and power for unseen LLMs across NPU/DDR frequency settings, selects QoE-satisfying efficient configurations under runtime interference, and uses lightweight limited-horizon thermal prediction to dynamically switch between energy-optimized and thermally constrained inference. Evaluations on real-world LLMs show that EnerInfer improves energy efficiency by up to 65%, 12%, and 24% on phones, a laptop, and a development board, respectively, without QoE violation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EnerInfer sketches a prediction-driven controller to cut energy on on-device LLMs without per-model profiling, but the abstract supplies no accuracy numbers or cross-model tests to back the 65% claim.

read the letter

The main point is that EnerInfer replaces profiling with model-structure-aware predictions of throughput and power across NPU/DDR frequencies, then uses ranking and short-horizon thermal forecasts to pick settings that keep QoE while lowering energy and heat. It reports gains of 65% on phones, 12% on a laptop, and 24% on a dev board.

What stands out as new is the joint handling of energy, throughput, and thermal comfort through disaggregated prediction plus online feedback, rather than just chasing decode speed. The problem framing is clear: no stable config ranking exists, commercial hardware lacks component sensors, and thermal state depends on request patterns. That setup matches real deployment constraints.

The paper does a reasonable job showing why existing speed-first systems leave energy on the table and why a lightweight controller could help. The shift to structure-aware prediction for unseen models is a sensible engineering move if it works.

The soft spot is the lack of any supporting data. The headline savings rest on the predictor generalizing to LLMs never seen in training and to runtime interference, yet the abstract gives no error rates, training-set details, held-out results, or accuracy under varying conditions. If prediction error rises, the ranking step either violates QoE or loses the reported savings. The thermal prediction piece is also described at a high level with no validation shown. These gaps make it impossible to judge whether the central assumption holds.

This is for systems people working on mobile or edge LLM deployment who need practical energy knobs. A reader focused on device-level optimization would find the architecture and problem statement useful even without the numbers.

Send it to review. The topic is relevant and the approach is a clear step past pure speed tuning, but the full paper must include the missing validation before the claims can be assessed.

Referee Report

2 major / 1 minor

Summary. The paper introduces EnerInfer, an on-device LLM inference framework that uses disaggregated model-structure-aware predictions of throughput and power to select QoE-safe NPU/DDR frequency configurations, combined with limited-horizon thermal prediction for dynamic switching. It claims this replaces per-model profiling and component sensors, yielding energy-efficiency gains of up to 65% on phones, 12% on a laptop, and 24% on a development board across real-world LLMs without QoE violation.

Significance. If the prediction-based selection generalizes reliably, the work would offer a practical advance for energy- and thermally-constrained on-device LLM deployment by exploiting configuration slack that speed-only optimizers miss. The disaggregated prediction approach could reduce the need for device-specific profiling, which is valuable for production systems.

major comments (2)

[Abstract] Abstract: the headline energy-efficiency gains (65%/12%/24%) rest on the claim that model-structure-aware throughput/power predictions generalize to unseen LLMs and select QoE-safe settings without per-model profiling or component sensors. No prediction accuracy metrics, training-set composition, held-out LLM results, or error analysis under runtime interference are supplied, so the central claim that the ranking step preserves QoE while delivering the reported savings cannot be evaluated.
[Evaluation] Evaluation section (implied by abstract claims): the absence of cross-model validation or sensitivity analysis for the predictors directly undermines the assertion that the framework works for LLMs never seen during predictor construction. If prediction error increases for models whose structure deviates from the training distribution, the QoE guarantee or the energy savings can fail; this is load-bearing for the replacement of profiling.

minor comments (1)

[Abstract] The abstract refers to 'disaggregated, model-structure-aware prediction' without defining the structural features used or the disaggregation granularity; a short methods paragraph would clarify this for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on EnerInfer. The comments highlight the need for clearer presentation of the prediction components that underpin our energy-efficiency claims. We respond to each major comment below and indicate where revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline energy-efficiency gains (65%/12%/24%) rest on the claim that model-structure-aware throughput/power predictions generalize to unseen LLMs and select QoE-safe settings without per-model profiling or component sensors. No prediction accuracy metrics, training-set composition, held-out LLM results, or error analysis under runtime interference are supplied, so the central claim that the ranking step preserves QoE while delivering the reported savings cannot be evaluated.

Authors: We agree the abstract is too terse on these supporting details. The full manuscript (Section 4.2) describes the disaggregated predictor training on a set of 12 LLMs and reports MAE for throughput and power on held-out models, plus a sensitivity study under background interference. However, these numbers are not summarized in the abstract. We will revise the abstract to include a concise statement of prediction accuracy (e.g., average MAE < 8% for throughput and < 12% for power on held-out models) and note the training-set composition. We will also add a short paragraph in the evaluation section explicitly linking prediction error to QoE preservation under the reported workloads. revision: yes
Referee: [Evaluation] Evaluation section (implied by abstract claims): the absence of cross-model validation or sensitivity analysis for the predictors directly undermines the assertion that the framework works for LLMs never seen during predictor construction. If prediction error increases for models whose structure deviates from the training distribution, the QoE guarantee or the energy savings can fail; this is load-bearing for the replacement of profiling.

Authors: The manuscript does contain cross-model results (held-out LLMs in Section 5.3) and a limited sensitivity analysis to structural deviation. Nevertheless, the referee is correct that a more explicit ablation showing how prediction error scales with model size and architecture deviation would strengthen the generalization argument. We will expand the evaluation section with an additional table reporting per-model prediction error and the resulting QoE margin for three LLMs outside the original training distribution, plus a short discussion of failure modes when error exceeds the QoE slack. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical predictors presented without self-referential reduction

full rationale

The provided abstract and context describe a systems framework that fits disaggregated predictors on model structure to estimate throughput/power for unseen LLMs, then uses those estimates for configuration selection. No equations, self-definitions, or load-bearing self-citations are shown that would make any 'prediction' equivalent to its training inputs by construction. The central energy-saving claims rest on reported empirical results across devices rather than a closed derivation chain, so the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5828 in / 836 out tokens · 10126 ms · 2026-06-26T07:55:57.985815+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Abdelhafez, Karthik Pattabiraman, and Matei Ripeanu

Amirhossein Ahmadi, Hazem A. Abdelhafez, Karthik Pattabiraman, and Matei Ripeanu. 2023. EdgeEngine: A Thermal-Aware Optimization Framework for Edge Inference. In2023 IEEE/ACM Symposium on Edge Computing (SEC). 67–79

2023
[2]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv (2023)

2023
[3]

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]

work page arXiv 2024
[4]

Apple. 2024. Introducing Apple’s On-Device and Server Foundation Models.https://machinelearning.apple.com/research/introducing- apple-foundation-models. Accessed 7 Feb 2025

2024
[5]

Mauricio Fadel Argerich and Marta Patiño-Martínez. 2024. Measur- ing and Improving the Energy Efficiency of Large Language Models Inference.IEEE Access12 (2024), 80194–80207

2024
[6]

Mariette Awad, Rahul Khanna, Mariette Awad, and Rahul Khanna
[7]

Support vector regression.Efficient learning machines: Theories, concepts, and applications for engineers and system designers(2015), 67–80

2015
[8]

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. [n. d.]. Open LLM Leaderboard.https://huggingfac e.co/spaces/HuggingFaceH4/open_llm_leaderboard. Accessed 11 Apr 2025

2025
[9]

Mulugeta K Berhe. 2007. Ergonomic temperature limits for hand- held electronic devices. InInternational Electronic Packaging Technical Conference and Exhibition, Vol. 42789. 1041–1047

2007
[10]

Leo Breiman. 2001. Random forests.Machine learning45 (2001), 5–32

2001
[11]

Marc Brysbaert. 2019. How many words do we read per minute? A review and meta-analysis of reading rate.Journal of Memory and Language109 (2019), 104047

2019
[12]

Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yubin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerating Large Lan- guage Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794 [cs.DC]

work page arXiv 2025
[13]

Marcus Chow and Daniel Wong. 2023. CoFRIS: Coordinated frequency and resource scaling for GPU inference servers. InProceedings of the 14th International Green and Sustainable Computing Conference. 45–51

2023
[14]

Lucian Codrescu, Willie Anderson, Suresh Venkumanhanti, Mao Zeng, Erich Plondke, Chris Koob, Ajay Ingle, Charles Tabony, and Rick Maule
[15]

Hexagon DSP: An architecture optimized for mobile multimedia and communications.IEEE Micro34, 2 (2014), 34–43

2014
[16]

Benj Edwards. 2024. Exponential growth brews 1 million AI models on Hugging Face.https://arstechnica.com/information-technology /2024/09/ai-hosting-platform-surpasses-1-million-models-for-the- first-time/. Accessed 7 Feb 2025

2024
[17]

FNIRSI. 2025. FNB58 USB Fast Charge Tester.https://www.fnirsi.com /products/fnb58. Accessed 25 Apr 2025

2025
[18]

Ricardo Gonzalez, Benjamin M Gordon, and Mark A Horowitz. 1997. Supply and threshold voltage scaling for low power CMOS.IEEE Journal of Solid-State Circuits32, 8 (1997), 1210–1216

1997
[19]

Google. 2025. Chat with Gemini to supercharge your creativity and productivity.https://store.google.com/intl/en/ideas/categories/ai/. Accessed 7 Feb 2025

2025
[20]

Google. 2025. Thermal mitigation.https://source.android.com/docs/ core/power/thermal-mitigation. Accessed 15 May 2025

2025
[21]

Joseph L Greathouse and Gabriel H Loh. 2018. Machine learning for performance and power modeling of heterogeneous systems. In Proceedings of the International Conference on Computer-Aided Design. 1–6

2018
[22]

Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting execution time of computer programs using sparse polynomial regression. InAdvances in neural information processing systems (NeurIPS). 883–891

2010
[23]

Christian Janiesch, Patrick Zschech, and Kai Heinrich. 2021. Machine learning and deep learning.Electronic markets31, 3 (2021), 685–695

2021
[24]

Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Se- bastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The sur- prising power of small language models.Microsoft Research Blog1, 3 (2023), 3

2023
[25]

JEDEC. 2023. LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X. https://www.jedec.org/standards- documents/docs/jesd209- 5c. Accessed 25 Apr 2025

2023
[26]

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. 2024. SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving.IEEE Computer Architecture Letters23, 2 (July 2024), 150–153

2024
[27]

M. G. KENDALL. 1938. A NEW MEASURE OF RANK CORRELATION. Biometrika30, 1-2 (06 1938), 81–93

1938
[28]

Seyeon Kim, Kyungmin Bin, Sangtae Ha, Kyunghan Lee, and Song Chong. 2022. zTT: Learning-Based DVFS with Zero Thermal Throt- tling for Mobile Devices.GetMobile: Mobile Comp. and Comm.25, 4 (March 2022), 30–34

2022
[29]

Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, and Hamed Haddadi. 2024. MELTing Point: Mobile Evaluation of Language Trans- formers. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking (MobiCom). 890–907

2024
[30]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. InInternational Conference on Learning Representations

2021
[31]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the 40th ICML (Proceedings of Machine Learning Research, Vol. 202). PMLR, 19274–19286

2023
[32]

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guo- hong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, and Yunxin Liu. 2024. Personal LLM Agents: Insights and Survey about the C...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Chengdong Lin, Kun Wang, Zhenjiang Li, and Yu Pu. 2023. A Workload- Aware DVFS Robust to Concurrent Tasks for Mobile Devices. InAn- nual International Conference on Mobile Computing and Networking (MobiCom). Article 19, 16 pages

2023
[34]

Jiachen Liu, Jae-Won Chung, Zhiyu Wu, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. 2024. Andes: Defining and Enhanc- ing Quality-of-Experience in LLM-Based Text Streaming Services. arXiv:2404.16283 [cs.DC]

work page arXiv 2024
[35]

Meta Llama Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Macken, M

P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey. 1990. A voltage reduction technique for digital systems. InIEEE International Conference on Solid-State Circuits. 238–239. 13

1990
[37]

Paul Joe Maliakel, Shashikant Ilager, and Ivona Brandic. 2025. Investi- gating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings. arXiv:2501.08219 [cs.LG]

work page arXiv 2025
[38]

2023-2025.MLC-LLM

MLC team. 2023-2025.MLC-LLM

2023
[39]

Dipayan Mukherjee, Sam Hachem, Jeremy Bao, Curtis Madsen, Tian Ma, Saugata Ghose, and Gul Agha. 2025. CRAVE: Analyzing Cross- Resource Interaction to Improve Energy Efficiency in Systems-on- Chip. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys ’25). 59–75

2025
[40]

Yang Ni, Yeseong Kim, Tajana Rosing, and Mohsen Imani. 2022. Online performance and power prediction for edge TPU via comprehensive characterization. In2022 Design, Automation & Test in Europe Confer- ence & Exhibition (DATE). IEEE, 612–615

2022
[41]

Harbin Institute of Technology and iFLYTEK Joint Laboratory (HFL)
[42]

https://huggingface.co/hfl/chinese-llama-2-1.3b

Chinese-LLaMA-2-1.3B: A Chinese-Enhanced LLaMA-2 Model. https://huggingface.co/hfl/chinese-llama-2-1.3b. Accessed 19 Aug 2025

2025
[43]

Ollama. 2025. Ollama: Chat & build with open models.https://ollama .com/. Accessed 15 May 2025

2025
[44]

OPPO. 2024. OPPO Find X8 Series to Debut MediaTek Dimensity 9400 SOC for Global Markets Combining Ultra Performance, Efficiency & AI Experiences. Accessed 25 Apr 2025

2024
[45]

Eva Ostertagová. 2012. Modelling using Polynomial Regression.Pro- cedia Engineering48 (2012), 500–506. Modelling of Mechanical and Mechatronics Systems

2012
[46]

Charlie Hu, Ming Zhang, Paramvir Bahl, and Yi- Min Wang

Abhinav Pathak, Y. Charlie Hu, Ming Zhang, Paramvir Bahl, and Yi- Min Wang. 2011. Fine-grained power modeling for smartphones using system call tracing. InProceedings of the Sixth Conference on Computer Systems(Salzburg, Austria)(EuroSys ’11). Association for Computing Machinery, New York, NY, USA, 153–168

2011
[47]

Orange Pi. 2025. Orange Pi 5 Pro.http://www.orangepi.org/. Accessed 25 Apr 2025

2025
[48]

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2024. Power-aware Deep Learning Model Serving with 𝜇-Serve. In2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 75–93

2024
[49]

Kalbarczyk, Tamer Başar, and Ravishankar K

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2024. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction. InThe 5th Interna- tional Workshop on Cloud Intelligence / AIOps at ASPLOS 2024, Vol. 5. 1–7

2024
[50]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019), 9

2019
[51]

Rafael J. Wysocki. 2017. intel pstate CPU Performance Scaling Driver. https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pst ate.html. Accessed 15 May 2025

2017
[52]

Rockchip. 2025. RKLLM Project.https://github.com/airockchip/rknn- llm. Accessed 25 Apr 2025

2025
[53]

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmark- ing the Energy Costs of Large Language Model Inference. In2023 IEEE High Performance Extreme Computing Conference (HPEC). 1–9

2023
[54]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP ’24). 590–606

2024
[55]

Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, and Josep Torrellas. 2024. Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference. arXiv:2403.20306 [cs.AI]

work page arXiv 2024
[56]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. InIEEE International Symposium on High Performance Computer Architecture (HPCA)

2025
[57]

Tianxiang Tan and Guohong Cao. 2024. Thermal-aware scheduling for deep learning on mobile devices with NPU.IEEE Transactions on Mobile Computing(2024)

2024
[58]

Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. 2019. The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study. InProceedings of the Tenth ACM Inter- national Conference on Future Energy Systems. 315–325

2019
[59]

Gemma Team. 2024. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

The Linux Kernel Community. 2024. Power Management Quality of Service (PM QoS) Interface.https://www.kernel.org/doc/html/latest/p ower/pm_qos_interface.html. Accessed 15 May 2025

2024
[61]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Susanne Trauzettel-Klosinski, Klaus Dietz, and IReST Study Group
[63]

Standardized assessment of reading performance: The new international reading speed texts IReST.Investigative ophthalmology & visual science53, 9 (2012), 5452–5461

2012
[64]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need.Advances in neural information processing systems (NeurIPS)30 (2017)

2017
[65]

Li Wang. 2021. British English-speaking speed 2020.Acad. J. Humanit. Soc. Sci4 (2021), 93–100

2021
[66]

Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao Xu, and Dacheng Tao. 2023. PanGu-𝜋: Enhanc- ing Language Model Architectures via Nonlinearity Compensation. arXiv:2312.17276 [cs.CL]

work page arXiv 2023
[67]

Zibo Wang, Yijia Zhang, Fuchun Wei, Bingqiang Wang, Yanlin Liu, Zhiheng Hu, Jingyi Zhang, Xiaoxin Xu, Jian He, Xiaoliang Wang, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency. InACM International Conference on Ar- chitectural Support for Programming L...

2025
[68]

Rafael J. Wysocki. 2017. CPU Performance Scaling.https://docs.kerne l.org/admin-guide/pm/cpufreq.html. Accessed 25 Apr 2025

2017
[69]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InInternational conference on machine learning (ICML). 38087–38099

2023
[70]

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Meng- wei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25). 445–462. 14

2025
[71]

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. 2024. PowerInfer-2: Fast Large Language Model Inference on a Smartphone. arXiv:2406.06282 [cs.LG]

work page arXiv 2024
[72]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guant- ing Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Kem- ing Lu, Keqin Chen, Kexin Yang,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, and Mengwei Xu. 2024. Mobile Foundation Model as Firmware. InAnnual International Conference on Mobile Computing and Network- ing (MobiCom). 279–295

2024
[74]

Wanghong Yuan and Klara Nahrstedt. 2003. Energy-efficient soft real- time CPU scheduling for mobile multimedia systems. InProceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA)(SOSP ’03). Association for Computing Machinery, New York, NY, USA, 149–163

2003
[75]

Sangwoon Yun and Kyungtae Kang. 2023. Runtime WCET Estimation Using Machine Learning. InAnnual International Conference on Mobile Computing and Networking (MobiCom). 1–3

2023
[76]

Charlie Hu, Jian Li, and Haibing Guan

Zongpu Zhang, Pranab Dash, Qiang Xu, Y. Charlie Hu, Jian Li, and Haibing Guan. 2026. Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE. InMLSys.https://openreview .net/forum?id=PSyHQ8kVUT

2026
[77]

Bohua Zou, Binqi Sun, Yigong Hu, Tomasz Kloda, Marco Caccamo, and Tarek Abdelzaher. 2024. A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining. InIEEE Military Communications Conference (MILCOM). 1–6. 15

2024

[1] [1]

Abdelhafez, Karthik Pattabiraman, and Matei Ripeanu

Amirhossein Ahmadi, Hazem A. Abdelhafez, Karthik Pattabiraman, and Matei Ripeanu. 2023. EdgeEngine: A Thermal-Aware Optimization Framework for Edge Inference. In2023 IEEE/ACM Symposium on Edge Computing (SEC). 67–79

2023

[2] [2]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv (2023)

2023

[3] [3]

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]

work page arXiv 2024

[4] [4]

Apple. 2024. Introducing Apple’s On-Device and Server Foundation Models.https://machinelearning.apple.com/research/introducing- apple-foundation-models. Accessed 7 Feb 2025

2024

[5] [5]

Mauricio Fadel Argerich and Marta Patiño-Martínez. 2024. Measur- ing and Improving the Energy Efficiency of Large Language Models Inference.IEEE Access12 (2024), 80194–80207

2024

[6] [6]

Mariette Awad, Rahul Khanna, Mariette Awad, and Rahul Khanna

[7] [7]

Support vector regression.Efficient learning machines: Theories, concepts, and applications for engineers and system designers(2015), 67–80

2015

[8] [8]

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. [n. d.]. Open LLM Leaderboard.https://huggingfac e.co/spaces/HuggingFaceH4/open_llm_leaderboard. Accessed 11 Apr 2025

2025

[9] [9]

Mulugeta K Berhe. 2007. Ergonomic temperature limits for hand- held electronic devices. InInternational Electronic Packaging Technical Conference and Exhibition, Vol. 42789. 1041–1047

2007

[10] [10]

Leo Breiman. 2001. Random forests.Machine learning45 (2001), 5–32

2001

[11] [11]

Marc Brysbaert. 2019. How many words do we read per minute? A review and meta-analysis of reading rate.Journal of Memory and Language109 (2019), 104047

2019

[12] [12]

Le Chen, Dahu Feng, Erhu Feng, Rong Zhao, Yingrui Wang, Yubin Xia, Haibo Chen, and Pinjie Xu. 2025. HeteroLLM: Accelerating Large Lan- guage Model Inference on Mobile SoCs platform with Heterogeneous AI Accelerators. arXiv:2501.14794 [cs.DC]

work page arXiv 2025

[13] [13]

Marcus Chow and Daniel Wong. 2023. CoFRIS: Coordinated frequency and resource scaling for GPU inference servers. InProceedings of the 14th International Green and Sustainable Computing Conference. 45–51

2023

[14] [14]

Lucian Codrescu, Willie Anderson, Suresh Venkumanhanti, Mao Zeng, Erich Plondke, Chris Koob, Ajay Ingle, Charles Tabony, and Rick Maule

[15] [15]

Hexagon DSP: An architecture optimized for mobile multimedia and communications.IEEE Micro34, 2 (2014), 34–43

2014

[16] [16]

Benj Edwards. 2024. Exponential growth brews 1 million AI models on Hugging Face.https://arstechnica.com/information-technology /2024/09/ai-hosting-platform-surpasses-1-million-models-for-the- first-time/. Accessed 7 Feb 2025

2024

[17] [17]

FNIRSI. 2025. FNB58 USB Fast Charge Tester.https://www.fnirsi.com /products/fnb58. Accessed 25 Apr 2025

2025

[18] [18]

Ricardo Gonzalez, Benjamin M Gordon, and Mark A Horowitz. 1997. Supply and threshold voltage scaling for low power CMOS.IEEE Journal of Solid-State Circuits32, 8 (1997), 1210–1216

1997

[19] [19]

Google. 2025. Chat with Gemini to supercharge your creativity and productivity.https://store.google.com/intl/en/ideas/categories/ai/. Accessed 7 Feb 2025

2025

[20] [20]

Google. 2025. Thermal mitigation.https://source.android.com/docs/ core/power/thermal-mitigation. Accessed 15 May 2025

2025

[21] [21]

Joseph L Greathouse and Gabriel H Loh. 2018. Machine learning for performance and power modeling of heterogeneous systems. In Proceedings of the International Conference on Computer-Aided Design. 1–6

2018

[22] [22]

Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting execution time of computer programs using sparse polynomial regression. InAdvances in neural information processing systems (NeurIPS). 883–891

2010

[23] [23]

Christian Janiesch, Patrick Zschech, and Kai Heinrich. 2021. Machine learning and deep learning.Electronic markets31, 3 (2021), 685–695

2021

[24] [24]

Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Se- bastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The sur- prising power of small language models.Microsoft Research Blog1, 3 (2023), 3

2023

[25] [25]

JEDEC. 2023. LOW POWER DOUBLE DATA RATE (LPDDR) 5/5X. https://www.jedec.org/standards- documents/docs/jesd209- 5c. Accessed 25 Apr 2025

2023

[26] [26]

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. 2024. SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving.IEEE Computer Architecture Letters23, 2 (July 2024), 150–153

2024

[27] [27]

M. G. KENDALL. 1938. A NEW MEASURE OF RANK CORRELATION. Biometrika30, 1-2 (06 1938), 81–93

1938

[28] [28]

Seyeon Kim, Kyungmin Bin, Sangtae Ha, Kyunghan Lee, and Song Chong. 2022. zTT: Learning-Based DVFS with Zero Thermal Throt- tling for Mobile Devices.GetMobile: Mobile Comp. and Comm.25, 4 (March 2022), 30–34

2022

[29] [29]

Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, and Hamed Haddadi. 2024. MELTing Point: Mobile Evaluation of Language Trans- formers. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking (MobiCom). 890–907

2024

[30] [30]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. InInternational Conference on Learning Representations

2021

[31] [31]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the 40th ICML (Proceedings of Machine Learning Research, Vol. 202). PMLR, 19274–19286

2023

[32] [32]

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guo- hong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, and Yunxin Liu. 2024. Personal LLM Agents: Insights and Survey about the C...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Chengdong Lin, Kun Wang, Zhenjiang Li, and Yu Pu. 2023. A Workload- Aware DVFS Robust to Concurrent Tasks for Mobile Devices. InAn- nual International Conference on Mobile Computing and Networking (MobiCom). Article 19, 16 pages

2023

[34] [34]

Jiachen Liu, Jae-Won Chung, Zhiyu Wu, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. 2024. Andes: Defining and Enhanc- ing Quality-of-Experience in LLM-Based Text Streaming Services. arXiv:2404.16283 [cs.DC]

work page arXiv 2024

[35] [35]

Meta Llama Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Macken, M

P. Macken, M. Degrauwe, M. Van Paemel, and H. Oguey. 1990. A voltage reduction technique for digital systems. InIEEE International Conference on Solid-State Circuits. 238–239. 13

1990

[37] [37]

Paul Joe Maliakel, Shashikant Ilager, and Ivona Brandic. 2025. Investi- gating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings. arXiv:2501.08219 [cs.LG]

work page arXiv 2025

[38] [38]

2023-2025.MLC-LLM

MLC team. 2023-2025.MLC-LLM

2023

[39] [39]

Dipayan Mukherjee, Sam Hachem, Jeremy Bao, Curtis Madsen, Tian Ma, Saugata Ghose, and Gul Agha. 2025. CRAVE: Analyzing Cross- Resource Interaction to Improve Energy Efficiency in Systems-on- Chip. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys ’25). 59–75

2025

[40] [40]

Yang Ni, Yeseong Kim, Tajana Rosing, and Mohsen Imani. 2022. Online performance and power prediction for edge TPU via comprehensive characterization. In2022 Design, Automation & Test in Europe Confer- ence & Exhibition (DATE). IEEE, 612–615

2022

[41] [41]

Harbin Institute of Technology and iFLYTEK Joint Laboratory (HFL)

[42] [42]

https://huggingface.co/hfl/chinese-llama-2-1.3b

Chinese-LLaMA-2-1.3B: A Chinese-Enhanced LLaMA-2 Model. https://huggingface.co/hfl/chinese-llama-2-1.3b. Accessed 19 Aug 2025

2025

[43] [43]

Ollama. 2025. Ollama: Chat & build with open models.https://ollama .com/. Accessed 15 May 2025

2025

[44] [44]

OPPO. 2024. OPPO Find X8 Series to Debut MediaTek Dimensity 9400 SOC for Global Markets Combining Ultra Performance, Efficiency & AI Experiences. Accessed 25 Apr 2025

2024

[45] [45]

Eva Ostertagová. 2012. Modelling using Polynomial Regression.Pro- cedia Engineering48 (2012), 500–506. Modelling of Mechanical and Mechatronics Systems

2012

[46] [46]

Charlie Hu, Ming Zhang, Paramvir Bahl, and Yi- Min Wang

Abhinav Pathak, Y. Charlie Hu, Ming Zhang, Paramvir Bahl, and Yi- Min Wang. 2011. Fine-grained power modeling for smartphones using system call tracing. InProceedings of the Sixth Conference on Computer Systems(Salzburg, Austria)(EuroSys ’11). Association for Computing Machinery, New York, NY, USA, 153–168

2011

[47] [47]

Orange Pi. 2025. Orange Pi 5 Pro.http://www.orangepi.org/. Accessed 25 Apr 2025

2025

[48] [48]

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2024. Power-aware Deep Learning Model Serving with 𝜇-Serve. In2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 75–93

2024

[49] [49]

Kalbarczyk, Tamer Başar, and Ravishankar K

Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Saurabh Jha, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, and Ravishankar K. Iyer. 2024. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction. InThe 5th Interna- tional Workshop on Cloud Intelligence / AIOps at ASPLOS 2024, Vol. 5. 1–7

2024

[50] [50]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019), 9

2019

[51] [51]

Rafael J. Wysocki. 2017. intel pstate CPU Performance Scaling Driver. https://www.kernel.org/doc/html/latest/admin-guide/pm/intel_pst ate.html. Accessed 15 May 2025

2017

[52] [52]

Rockchip. 2025. RKLLM Project.https://github.com/airockchip/rknn- llm. Accessed 25 Apr 2025

2025

[53] [53]

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From Words to Watts: Benchmark- ing the Energy Costs of Large Language Model Inference. In2023 IEEE High Performance Extreme Computing Conference (HPEC). 1–9

2023

[54] [54]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP ’24). 590–606

2024

[55] [55]

Jovan Stojkovic, Esha Choukse, Chaojie Zhang, Inigo Goiri, and Josep Torrellas. 2024. Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference. arXiv:2403.20306 [cs.AI]

work page arXiv 2024

[56] [56]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. InIEEE International Symposium on High Performance Computer Architecture (HPCA)

2025

[57] [57]

Tianxiang Tan and Guohong Cao. 2024. Thermal-aware scheduling for deep learning on mobile devices with NPU.IEEE Transactions on Mobile Computing(2024)

2024

[58] [58]

Zhenheng Tang, Yuxin Wang, Qiang Wang, and Xiaowen Chu. 2019. The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study. InProceedings of the Tenth ACM Inter- national Conference on Future Energy Systems. 315–325

2019

[59] [59]

Gemma Team. 2024. Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

The Linux Kernel Community. 2024. Power Management Quality of Service (PM QoS) Interface.https://www.kernel.org/doc/html/latest/p ower/pm_qos_interface.html. Accessed 15 May 2025

2024

[61] [61]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma- hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Susanne Trauzettel-Klosinski, Klaus Dietz, and IReST Study Group

[63] [63]

Standardized assessment of reading performance: The new international reading speed texts IReST.Investigative ophthalmology & visual science53, 9 (2012), 5452–5461

2012

[64] [64]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need.Advances in neural information processing systems (NeurIPS)30 (2017)

2017

[65] [65]

Li Wang. 2021. British English-speaking speed 2020.Acad. J. Humanit. Soc. Sci4 (2021), 93–100

2021

[66] [66]

Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao Xu, and Dacheng Tao. 2023. PanGu-𝜋: Enhanc- ing Language Model Architectures via Nonlinearity Compensation. arXiv:2312.17276 [cs.CL]

work page arXiv 2023

[67] [67]

Zibo Wang, Yijia Zhang, Fuchun Wei, Bingqiang Wang, Yanlin Liu, Zhiheng Hu, Jingyi Zhang, Xiaoxin Xu, Jian He, Xiaoliang Wang, Wanchun Dou, Guihai Chen, and Chen Tian. 2025. Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency. InACM International Conference on Ar- chitectural Support for Programming L...

2025

[68] [68]

Rafael J. Wysocki. 2017. CPU Performance Scaling.https://docs.kerne l.org/admin-guide/pm/cpufreq.html. Accessed 25 Apr 2025

2017

[69] [69]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. InInternational conference on machine learning (ICML). 38087–38099

2023

[70] [70]

Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Meng- wei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25). 445–462. 14

2025

[71] [71]

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. 2024. PowerInfer-2: Fast Large Language Model Inference on a Smartphone. arXiv:2406.06282 [cs.LG]

work page arXiv 2024

[72] [72]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guant- ing Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Kem- ing Lu, Keqin Chen, Kexin Yang,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, and Mengwei Xu. 2024. Mobile Foundation Model as Firmware. InAnnual International Conference on Mobile Computing and Network- ing (MobiCom). 279–295

2024

[74] [74]

Wanghong Yuan and Klara Nahrstedt. 2003. Energy-efficient soft real- time CPU scheduling for mobile multimedia systems. InProceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA)(SOSP ’03). Association for Computing Machinery, New York, NY, USA, 149–163

2003

[75] [75]

Sangwoon Yun and Kyungtae Kang. 2023. Runtime WCET Estimation Using Machine Learning. InAnnual International Conference on Mobile Computing and Networking (MobiCom). 1–3

2023

[76] [76]

Charlie Hu, Jian Li, and Haibing Guan

Zongpu Zhang, Pranab Dash, Qiang Xu, Y. Charlie Hu, Jian Li, and Haibing Guan. 2026. Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE. InMLSys.https://openreview .net/forum?id=PSyHQ8kVUT

2026

[77] [77]

Bohua Zou, Binqi Sun, Yigong Hu, Tomasz Kloda, Marco Caccamo, and Tarek Abdelzaher. 2024. A Performance Prediction-based DNN Partitioner for Edge TPU Pipelining. InIEEE Military Communications Conference (MILCOM). 1–6. 15

2024