pith. sign in

arxiv: 2603.26603 · v2 · pith:VBBAFXNAnew · submitted 2026-03-27 · 💻 cs.SE · cs.AI· cs.LG

Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

Pith reviewed 2026-05-21 09:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords on-device LLMsenergy consumptionquantizationmobile devicestrade-offsMixture-of-Expertsperformance profilingsustainable AI
0
0 comments X

The pith

Quantization reduces memory for on-device LLMs but yields negligible energy savings, making architecture the key to battery life.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a replicable pipeline to measure energy use, latency, and output quality for LLMs running on a real Android phone without root access. It demonstrates that importance-aware quantization successfully shrinks memory needs to run bigger models yet delivers almost no energy reduction compared with standard mixed-precision approaches. The results also show that Mixture-of-Experts models store data like a 7B model while drawing power like a 1B or 2B model. These patterns point to mid-sized models such as Qwen2.5-3B as a practical balance between response quality and sustainable power draw. A reader would care because moving language models to phones promises privacy and offline access, yet battery limits remain the binding constraint.

Core claim

The authors constructed a replicable experimental pipeline to profile the interplay between energy consumption, latency, and generation quality of LLMs on a flagship Android device. They uncovered a quantization energy paradox in which importance-aware quantization reduces memory footprints to fit larger models into RAM but yields negligible energy savings compared to standard mixed-precision methods. This establishes that model architecture, rather than quantization scheme, is the decisive factor for battery life. Mixture-of-Experts architectures store like 7B models yet maintain the lower energy profile of 1B to 2B models. Mid-sized models such as Qwen2.5-3B balance response quality with a

What carries the argument

The quantization energy paradox, which shows that importance-aware quantization fits larger models into RAM but saves little energy compared to mixed-precision methods and therefore makes architecture the controlling factor for power use.

If this is right

  • For battery-limited phones, selecting models with efficient architectures such as Mixture-of-Experts permits larger capacity without proportional increases in energy cost.
  • Developers can rely on standard mixed-precision quantization rather than more complex importance-aware methods without losing battery performance.
  • Mid-sized models provide the clearest practical compromise among quality, energy draw, and resource use under real device constraints.
  • On-device deployment for privacy and offline use becomes more feasible once the right model size and architecture are chosen.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future edge LLMs should target architectural efficiency rather than further quantization refinements to improve real-world sustainability.
  • Benchmarking tools for mobile AI should incorporate architecture-specific energy profiles instead of relying mainly on parameter count or quantization level.
  • Extending the same profiling approach to other hardware platforms could test whether the dominance of architecture over quantization generalizes.

Load-bearing premise

The measurements taken on a single flagship Android device without root access accurately reflect typical user energy consumption and latency without being dominated by thermal throttling or background processes.

What would settle it

Repeating the same model runs on additional devices or with root-level power tracing and observing large energy reductions from importance-aware quantization would falsify the claim that architecture alone determines battery life.

Figures

Figures reproduced from arXiv: 2603.26603 by Eziyo Ehsani, Ivano Malavolta, Luca Giamattei, Roberto Pietrantuono.

Figure 1
Figure 1. Figure 1: Q4_K_M: block-wise mixed precision with higher precision for more sensitive components [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: IQ4_XS: importance-aware 4-bit quantization using calibration data and codebook-based reconstruction. B. Experimental Setup and Orchestration We designed the evaluation framework to accurately reflect a realistic, unrooted consumer smartphone setting while main￾taining strict, automated run-to-run reproducibility suitable for empirical software engineering research. The complete testing system comprises th… view at source ↗
Figure 3
Figure 3. Figure 3: Controlled execution pipeline. Setup is executed once; measurement and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Measurement workflow. A user-space monitoring app logs voltage, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Throughput across models and quantization schemes. Prefill benefits from parallel processing of the prompt, whereas generation is slower due to [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Inference latency breakdown. Qwen2-0.5B Qwen2.5-1.5B Phi-2 Qwen2.5-3B OLMoE-1B-7B-0125 Qwen2.5-7B Llama3.1-8B Gemma2-9B 0 2 4 6 8 Time to First Token (s) Q4 K M (Time) IQ4 XS (Time) Q4 K M (Energy) IQ4 XS (Energy) 0 1 2 3 4 Energy per Token (Joules) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of Time to First Token (Bars, Left Axis) and Energy per [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multi-objective trade-offs across models and quantization schemes. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Total energy per run across models and quantization schemes. Distributions summarize 30 repetitions per configuration. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a replicable and reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality of LLMs on mobile devices. We harness this pipeline to conduct an empirical case study on a flagship Android device, capturing granular metrics across eight LLMs ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. The findings highlight the trade-offs between generation quality, performance, power and resource consumption, revealing which LLMs offer the best balance across metrics and under different conditions. Besides, we uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a replicable experimental pipeline for profiling energy consumption, latency, generation quality, and resource use of LLMs on a non-rooted flagship Android device. In a case study with eight models (0.5B–9B parameters), it reports multi-objective trade-offs and identifies a quantization energy paradox: importance-aware quantization reduces memory footprints but yields negligible energy savings relative to standard mixed-precision methods, leading to the conclusion that model architecture—not quantization scheme—is the decisive factor for battery life. The work also notes that Mixture-of-Experts architectures maintain low energy profiles despite larger storage requirements and identifies mid-sized models (e.g., Qwen2.5-3B) as pragmatic sweet spots balancing quality and sustainability.

Significance. If the measurements prove robust, the paper supplies valuable real-device data that challenges common assumptions about quantization benefits for energy efficiency in on-device LLMs. The replicable pipeline without root access and the concrete multi-metric findings constitute clear strengths that support reproducibility and practical guidance. The quantization energy paradox and the efficiency observations for MoE models could usefully inform architecture choices for sustainable edge deployment.

major comments (2)
  1. [§3 (Experimental Pipeline) and quantization results] §3 (Experimental Pipeline) and the quantization results: The central paradox claim—that importance-aware quantization produces negligible energy savings versus mixed-precision—rests on energy deltas measured via public non-rooted Android APIs. These readings are susceptible to thermal throttling, background processes, and frequency scaling; without reported per-trial variance, error bars, or statistical tests comparing the small deltas, it is unclear whether the observed differences exceed measurement noise and can support the strong conclusion that architecture alone is decisive.
  2. [Results on MoE models] Results on MoE models: The claim that MoE architectures combine 7B-scale storage with 1B–2B energy profiles requires explicit controls or ablations showing that the energy savings arise from sparse activation rather than other model-specific factors (e.g., layer widths or token throughput). Absent such detail, the deviation from the standard size-energy trend remains suggestive rather than conclusive.
minor comments (3)
  1. [Abstract] Abstract: The sentence 'This proves that...' overstates an empirical observation; rephrase to 'suggests' or 'indicates' to reflect the measurement-based nature of the finding.
  2. [Figures and tables] Figures and tables: Ensure all energy and latency plots include units, error bars where available, and legends that distinguish quantization variants clearly.
  3. [Related work] Related work: Add citations to prior mobile LLM energy studies that used comparable Android APIs or external metering for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the presentation of our empirical findings. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3 (Experimental Pipeline) and quantization results] §3 (Experimental Pipeline) and the quantization results: The central paradox claim—that importance-aware quantization produces negligible energy savings versus mixed-precision—rests on energy deltas measured via public non-rooted Android APIs. These readings are susceptible to thermal throttling, background processes, and frequency scaling; without reported per-trial variance, error bars, or statistical tests comparing the small deltas, it is unclear whether the observed differences exceed measurement noise and can support the strong conclusion that architecture alone is decisive.

    Authors: We agree that explicit reporting of measurement variability is essential for supporting claims about small energy deltas. Our experimental protocol included repeated trials under controlled conditions to reduce the impact of background processes and thermal effects, but we did not include per-trial variance or formal statistical comparisons in the original submission. In the revised manuscript we will add error bars (standard deviation across runs) to all energy and latency plots and include paired statistical tests to establish that the observed differences between quantization schemes exceed measurement noise. These additions will provide clearer support for the conclusion that model architecture is the dominant factor. revision: yes

  2. Referee: [Results on MoE models] Results on MoE models: The claim that MoE architectures combine 7B-scale storage with 1B–2B energy profiles requires explicit controls or ablations showing that the energy savings arise from sparse activation rather than other model-specific factors (e.g., layer widths or token throughput). Absent such detail, the deviation from the standard size-energy trend remains suggestive rather than conclusive.

    Authors: We appreciate the call for greater isolation of the sparsity effect. Our results derive from head-to-head profiling of multiple models, including MoE variants, on identical hardware and workloads; the lower energy draw of the MoE models is consistent with their known sparse activation pattern. Dedicated ablations that hold all other architectural variables constant are not feasible with the publicly available models we evaluated. In the revision we will expand the discussion section to explicitly list potential confounding factors (layer widths, token throughput) and qualify the MoE observation as a comparative finding rather than a causal claim, while retaining the empirical trend as a useful practical signal for practitioners. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements and observations stand independently of any fitted derivations.

full rationale

The paper describes construction of an experimental pipeline for direct profiling of energy, latency, quality, and memory on a non-rooted Android device across eight LLMs. All central claims, including the quantization energy paradox and the conclusion that model architecture dominates battery life, are presented as outcomes of these replicable measurements rather than predictions derived from equations, parameters fitted to the same data, or self-cited uniqueness theorems. No load-bearing step reduces by construction to its own inputs; the work reports observed trade-offs under stated conditions without renaming known results or smuggling ansatzes via prior citations. This is the expected finding for a measurement-driven empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical measurements rather than mathematical derivations; no free parameters are fitted to produce the paradox, and no new entities are postulated.

axioms (1)
  • domain assumption Measurements taken without root access on a single flagship Android device accurately reflect typical user energy and latency under realistic conditions.
    Invoked when the authors state findings reflect realistic user conditions without root.

pith-pipeline@v0.9.0 · 5843 in / 1253 out tokens · 48622 ms · 2026-05-21T09:23:11.417204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 11 internal anchors

  1. [1]

    Edge computing: Vision and challenges,

    W. Shi, J. Cao, Q. Zhang, Y . Li, and L. Xu, “Edge computing: Vision and challenges,”IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016

  2. [2]

    Sustainable AI: Environmental implications, challenges and opportunities,

    C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Baiet al., “Sustainable AI: Environmental implications, challenges and opportunities,”Proceedings of Machine Learning and Systems, vol. 4, pp. 795–813, 2022

  3. [3]

    Efficiently scaling transformer inference,

    R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,”Proceedings of Machine Learning and Systems, vol. 5, 2023

  4. [4]

    DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,

    R. Y . Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Fanget al., “DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15

  5. [5]

    [Online]

    Monsoon Solutions, Inc.,High V oltage Power Monitor (P/N: AAA10F) User Manual, Monsoon Solutions, Inc., Bellevue, WA, USA, 2024. [Online]. Available: https://www.msoon.com/high-voltage-power-monitor

  6. [6]

    Evaluating the effectiveness of model-based power characterization,

    J. C. McCullough, Y . Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” inProceedings of the 2011 USENIX Annual Technical Conference (USENIX ATC ’11), 2011

  7. [7]

    AppScope: Application energy metering framework for Android smartphones using kernel activity monitoring,

    C. Yoon, D. Kim, W. Jung, C. Kang, and H. Cha, “AppScope: Application energy metering framework for Android smartphones using kernel activity monitoring,” inProceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC ’12), 2012

  8. [8]

    llama.cpp: LLM inference in C/C++,

    G. Gerganov and llama.cpp contributors, “llama.cpp: LLM inference in C/C++,” https://github.com/ggml-org/llama.cpp, 2023

  9. [9]

    BERTScore: Evaluating text generation with BERT,

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inProceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, Apr 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr

  10. [10]

    G-Eval: NLG evaluation using GPT-4 with better human alignment,

    Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG evaluation using GPT-4 with better human alignment,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore: Association for Computational Linguistics, Dec 2023, pp. 2511–2522. [Online]. Available: https://aclanthology.org/2023.emnlp-main.153

  11. [11]

    Green AI,

    R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

  12. [12]

    Energy and policy consid- erations for deep learning in NLP,

    E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

  13. [13]

    The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,

    Y . Yuan, J. Zhang, Z. Zhang, K. Chen, J. Shi, V . Stoico, and I. Malavolta, “The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,” inProceedings of the 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI (CAIN ’24). Lisbon, Portugal: ACM, 2024

  14. [14]

    On-device or remote? on the energy efficiency of fetching llm-generated content,

    V . Nguyen, V . Dhopate, H. Huynh, H. Bouhlal, A. Annengala, G. L. Scoccia, M. Martinez, V . Stoico, and I. Malavolta, “On-device or remote? on the energy efficiency of fetching llm-generated content,” inProceedings of the 2025 IEEE/ACM 4th International Conference on AI Engineering - Software Engineering for AI (CAIN ’25). IEEE, 2025, pp. 72–82

  15. [15]

    Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge,

    M. Abstreiter, “Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge,” inProceedings of the 4th Workshop on Machine Learning and Systems (EuroMLSys ’24). Athens, Greece: ACM, 2024, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3642970.3655835

  16. [16]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, F. Seide, S. Hanet al., “Smoothquant: Accurate and efficient post-training quantization for large language models,” in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

  17. [17]

    GPTQ: Accurate post-training quantization for generative pre-trained transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” in Proceedings of the 11th International Conference on Learning Represen- tations (ICLR), 2023

  18. [18]

    LLM.int8(): 8-bit matrix multiplication for transformers at scale,

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 30 318–30 332

  19. [19]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” inProceedings of the 7th MLSys Conference (MLSys 2024), 2024, santa Clara, CA. [Online]. Available: https://arxiv.org/abs/2306.00978

  20. [20]

    TVM: An automated end-to-end optimizing compiler for deep learning,

    T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Cezeet al., “TVM: An automated end-to-end optimizing compiler for deep learning,” inProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 578–594

  21. [21]

    MELTing Point: Mobile evaluation of language transformers,

    S. Laskaridis, K. Katevas, L. Minto, and H. Haddadi, “MELTing Point: Mobile evaluation of language transformers,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking (MobiCom ’24). Washington D.C., USA: ACM, Nov 2024, pp. 890–907. [Online]. Available: https://doi.org/10.1145/3636534.3690668

  22. [22]

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 16 344–16 359

  23. [23]

    An analysis of power consumption in a smartphone,

    A. Carroll and G. Heiser, “An analysis of power consumption in a smartphone,” inProceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC ’10), vol. 14, Boston, MA, 2010, pp. 21–21

  24. [24]

    Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,

    A. Pathak, Y . C. Hu, and M. Zhang, “Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,” in Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys ’11), 2011, pp. 29–42

  25. [25]

    Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,

    S2-group, “Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,” https://github.com/S2-group/ batterymanager-companion/, 2024

  26. [26]

    Green mining: investigating power consumption across versions,

    A. Hindle, A. Wilson, K. Rasmussen, E. J. Jedwab, R. Godfrey, and P. Sweeney, “Green mining: investigating power consumption across versions,” inProceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE, 2012, pp. 1305–1308

  27. [27]

    A framework for the automatic execution of measurement-based experiments on android devices,

    I. Malavolta, E. M. Grua, C.-Y . Lam, R. de Vries, F. Tan, E. Zielinski, M. Peters, and L. Kaandorp, “A framework for the automatic execution of measurement-based experiments on android devices,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE ’20). ACM/IEEE, 2020

  28. [28]

    Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,

    M. Karsten, A. C. Dragomir, R. Apsan, V . Stoico, and I. Malavolta, “Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,”Science of Computer Programming, vol. 239, p. 103415, Jan 2025

  29. [29]

    Trepn power profiler,

    Qualcomm Technologies, Inc., “Trepn power profiler,” Qualcomm Developer Network, 2024, accessed: 2026-02-12. [Online]. Available: https://developer.qualcomm.com/forums/software/trepn-power-profiler

  30. [30]

    Codecarbon: Estimate and track carbon emissions from machine learning computing,

    V . Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank, J. Wilson, S. Friedler, and S. Luccioni, “Codecarbon: Estimate and track carbon emissions from machine learning computing,” 2021

  31. [31]

    Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,

    J. Sallou, L. Cruz, and T. Durieux, “Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,”arXiv preprint arXiv:2312.13897, 2023

  32. [32]

    Open LLM Leaderboard,

    E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lambert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf, “Open LLM Leaderboard,” Hugging Face Space, 2023. [Online]. Available: https://huggingface.co/ spaces/open-llm-leaderboard/open llm leaderboard

  33. [33]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 27 730–27 744

  34. [34]

    Finetuned language models are zero-shot learners,

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inProceedings of the 9th International Conference on Learning Representations (ICLR), 2021

  35. [35]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024. [Online]. Available: https: //arxiv.org/abs/2407.10671

  36. [36]

    Qwen2.5 Technical Report

    Qwen Team, “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.15115

  37. [37]

    Phi-2: The surprising power of small language models,

    M. Javaheripi and S. Bubeck, “Phi-2: The surprising power of small language models,” Microsoft Research Blog, Dec 2023. [Online]. Available: https://www.microsoft.com/en-us/research/blog/ phi-2-the-surprising-power-of-small-language-models/

  38. [39]

    OLMoE: Open Mixture-of-Experts Language Models

    [Online]. Available: https://arxiv.org/abs/2409.02060

  39. [40]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letak, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  40. [41]

    Gemma 2: Improving Open Language Models at a Practical Size

    G. DeepMind, “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024. [Online]. Available: https://arxiv.org/abs/2408.00118

  41. [42]

    AI benchmark: Running deep neural networks on Android smartphones,

    A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool, “AI benchmark: Running deep neural networks on Android smartphones,” inProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0

  42. [43]

    llama.cpp quantize tool,

    G. Gerganov and llama.cpp contributors, “llama.cpp quantize tool,” 2023, accessed: 2026-02-08. [Online]. Available: https://github.com/ ggml-org/llama.cpp/tree/master/examples/quantize

  43. [44]

    Dettmers, M

    T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //arxiv.org/abs/2110.02861

  44. [45]

    Android Debug Bridge (ADB),

    Google, “Android Debug Bridge (ADB),” 2024, accessed: 2026-02-12. [Online]. Available: https://developer.android.com/tools/adb

  45. [46]

    Understanding the energy consumption of Android app idle states,

    M. A. Hoque, M. Siekkinen, and J. K. Nurminen, “Understanding the energy consumption of Android app idle states,”Pervasive and Mobile Computing, vol. 24, pp. 68–86, 2015

  46. [47]

    Power side-channel attacks on mobile devices: A survey,

    M. Li, Y . Gao, S. F. Al-Sarawi, and D. Abbott, “Power side-channel attacks on mobile devices: A survey,”IEEE Access, vol. 10, pp. 6718– 6736, 2022

  47. [48]

    Android BatteryManager API reference,

    Google, “Android BatteryManager API reference,” 2024, accessed: 2026-02-08. [Online]. Available: https://developer.android.com/reference/ android/os/BatteryManager

  48. [49]

    SummEval: Re-evaluating summarization evaluation,

    A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evaluation,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 2021

  49. [50]

    Efficient memory management for large language model serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and C. Re, “Efficient memory management for large language model serving with PagedAttention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23), 2023, pp. 611–626

  50. [51]

    Hitting the memory wall: implications of the obvious,

    W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,”ACM SIGARCH Computer Architecture News, vol. 23, no. 1, pp. 20–24, 1995

  51. [52]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  52. [53]

    TensorFlow Lite Micro: Embedded machine learning for TinyML systems,

    R. David, J. Duke, A. Jain, V . J. Reddi, N. Jeffries, J. Li, N. Krentz, T. Cruesoe, and P. Warden, “TensorFlow Lite Micro: Embedded machine learning for TinyML systems,”Proceedings of Machine Learning and Systems, vol. 3, pp. 800–811, 2021

  53. [54]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

  54. [55]

    Response time in man-computer conversational transac- tions,

    R. B. Miller, “Response time in man-computer conversational transac- tions,” inProceedings of the fall joint computer conference, part I, 1968, pp. 267–277

  55. [56]

    Nielsen,Usability engineering

    J. Nielsen,Usability engineering. Morgan Kaufmann, 1993

  56. [57]

    Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,

    S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” inProceedings of the 3rd International Conference on Learning Representations (ICLR), 2015

  57. [58]

    Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,

    J. Kong, S. W. Chung, and K. Choi, “Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,” inProceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE ’13). IEEE, 2013, pp. 1119–1124

  58. [59]

    W., and Keutzer, K

    A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” arXiv preprint arXiv:2103.13630, 2021

  59. [60]

    A White Paper on Neural Network Quantization

    M. Nagel, M. Fournarakis, R. A. Amjad, Y . Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021

  60. [61]

    I- BERT: Integer-only BERT quantization,

    S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I- BERT: Integer-only BERT quantization,” inProceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 2021, pp. 5506–5518

  61. [62]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

  62. [63]

    Mixture-of-experts with expert choice routing,

    Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudonet al., “Mixture-of-experts with expert choice routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 7103–7114

  63. [64]

    MegaBlocks: Efficient sparse training with mixture-of-experts,

    T. Gale, D. Narayanan, C. Young, and M. Zaharia, “MegaBlocks: Efficient sparse training with mixture-of-experts,”Proceedings of Machine Learning and Systems, vol. 5, pp. 288–304, 2023

  64. [65]

    Temperature-aware microarchitecture,

    K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, “Temperature-aware microarchitecture,”ACM SIGARCH Computer Architecture News, vol. 32, no. 2, pp. 2–13, 2004

  65. [66]

    Power and energy characteriza- tion of ARM processors,

    V . Keller, R. Lachaize, V . Gramoliet al., “Power and energy characteriza- tion of ARM processors,” inProceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2014, pp. 116–126

  66. [67]

    Dark silicon and the end of multicore scaling,

    H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,”IEEE Micro, vol. 32, no. 3, pp. 122–134, 2011

  67. [68]

    Towards sustainable AI: a comprehensive study of carbon footprints in large language models,

    Y . Wang, Y . Li, X. Zheng, and H. Liu, “Towards sustainable AI: a comprehensive study of carbon footprints in large language models,” arXiv preprint arXiv:2310.03093, 2023

  68. [69]

    Understanding and mitigating the security risks of voice-driven interfaces,

    C. Yan, X. Ji, K. Wang, Q. Jiang, Z. Jin, and W. Xu, “Understanding and mitigating the security risks of voice-driven interfaces,” inProceedings of the 29th USENIX Security Symposium (USENIX Security 20), 2020, pp. 2625–2642

  69. [70]

    Guidelines for conducting and reporting case study research in software engineering,

    P. Runeson and M. H ¨ost, “Guidelines for conducting and reporting case study research in software engineering,”Empirical Software Engineering, vol. 14, no. 2, pp. 131–164, 2009

  70. [71]

    Software wear management for persistent memories,

    V . Gogte, W. Wang, A. Kolli, and T. F. Wenisch, “Software wear management for persistent memories,” inProceedings of the 17th USENIX Conference on File and Storage Technologies (F AST ’19), 2019, pp. 45– 58

  71. [72]

    Fast inference from transform- ers via speculative decoding,

    Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning (ICML). PMLR, 2023, pp. 19 274– 19 286

  72. [73]

    Taking AI to the edge: Arm’s new neural processing units,

    S. Cass, “Taking AI to the edge: Arm’s new neural processing units,” IEEE Spectrum, vol. 56, no. 5, pp. 16–17, 2019

  73. [74]

    Lithium-ion battery degradation: what you need to know,

    S. Pelletier, O. Jabali, G. Laporte, and M. Veneroni, “Lithium-ion battery degradation: what you need to know,”Physical Chemistry Chemical Physics, vol. 19, no. 32, pp. 21 231–21 245, 2017

  74. [75]

    MLPerf inference benchmark,

    V . J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Maximov, T. Choudhury, D. Gregget al., “MLPerf inference benchmark,”ACM SIGARCH Computer Architecture News, vol. 48, no. 1, pp. 50–65, 2020

  75. [76]

    Judging LLM-as-a-judge with MT-Bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Hao, Z. Wu, J. Ba, Z. L. Jiang, Z. Wu, A. Mirza, Z. Liet al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS ’23), vol. 36, 2023

  76. [77]

    Carbon Emissions and Large Neural Network Training

    D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,”arXiv preprint arXiv:2104.10350, 2021

  77. [78]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”arXiv preprint arXiv:2305.05176, 2023

  78. [79]

    Efficient streaming language models with attention sinks,

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2023