Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

Eziyo Ehsani; Ivano Malavolta; Luca Giamattei; Roberto Pietrantuono

arxiv: 2603.26603 · v2 · pith:VBBAFXNAnew · submitted 2026-03-27 · 💻 cs.SE · cs.AI· cs.LG

Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

Eziyo Ehsani , Luca Giamattei , Ivano Malavolta , Roberto Pietrantuono This is my paper

Pith reviewed 2026-05-21 09:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords on-device LLMsenergy consumptionquantizationmobile devicestrade-offsMixture-of-Expertsperformance profilingsustainable AI

0 comments

The pith

Quantization reduces memory for on-device LLMs but yields negligible energy savings, making architecture the key to battery life.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a replicable pipeline to measure energy use, latency, and output quality for LLMs running on a real Android phone without root access. It demonstrates that importance-aware quantization successfully shrinks memory needs to run bigger models yet delivers almost no energy reduction compared with standard mixed-precision approaches. The results also show that Mixture-of-Experts models store data like a 7B model while drawing power like a 1B or 2B model. These patterns point to mid-sized models such as Qwen2.5-3B as a practical balance between response quality and sustainable power draw. A reader would care because moving language models to phones promises privacy and offline access, yet battery limits remain the binding constraint.

Core claim

The authors constructed a replicable experimental pipeline to profile the interplay between energy consumption, latency, and generation quality of LLMs on a flagship Android device. They uncovered a quantization energy paradox in which importance-aware quantization reduces memory footprints to fit larger models into RAM but yields negligible energy savings compared to standard mixed-precision methods. This establishes that model architecture, rather than quantization scheme, is the decisive factor for battery life. Mixture-of-Experts architectures store like 7B models yet maintain the lower energy profile of 1B to 2B models. Mid-sized models such as Qwen2.5-3B balance response quality with a

What carries the argument

The quantization energy paradox, which shows that importance-aware quantization fits larger models into RAM but saves little energy compared to mixed-precision methods and therefore makes architecture the controlling factor for power use.

If this is right

For battery-limited phones, selecting models with efficient architectures such as Mixture-of-Experts permits larger capacity without proportional increases in energy cost.
Developers can rely on standard mixed-precision quantization rather than more complex importance-aware methods without losing battery performance.
Mid-sized models provide the clearest practical compromise among quality, energy draw, and resource use under real device constraints.
On-device deployment for privacy and offline use becomes more feasible once the right model size and architecture are chosen.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future edge LLMs should target architectural efficiency rather than further quantization refinements to improve real-world sustainability.
Benchmarking tools for mobile AI should incorporate architecture-specific energy profiles instead of relying mainly on parameter count or quantization level.
Extending the same profiling approach to other hardware platforms could test whether the dominance of architecture over quantization generalizes.

Load-bearing premise

The measurements taken on a single flagship Android device without root access accurately reflect typical user energy consumption and latency without being dominated by thermal throttling or background processes.

What would settle it

Repeating the same model runs on additional devices or with root-level power tracing and observing large energy reductions from importance-aware quantization would falsify the claim that architecture alone determines battery life.

Figures

Figures reproduced from arXiv: 2603.26603 by Eziyo Ehsani, Ivano Malavolta, Luca Giamattei, Roberto Pietrantuono.

**Figure 2.** Figure 2: IQ4_XS: importance-aware 4-bit quantization using calibration data and codebook-based reconstruction. B. Experimental Setup and Orchestration We designed the evaluation framework to accurately reflect a realistic, unrooted consumer smartphone setting while maintaining strict, automated run-to-run reproducibility suitable for empirical software engineering research. The complete testing system comprises th… view at source ↗

**Figure 3.** Figure 3: Controlled execution pipeline. Setup is executed once; measurement and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Measurement workflow. A user-space monitoring app logs voltage, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Throughput across models and quantization schemes. Prefill benefits from parallel processing of the prompt, whereas generation is slower due to [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Inference latency breakdown. Qwen2-0.5B Qwen2.5-1.5B Phi-2 Qwen2.5-3B OLMoE-1B-7B-0125 Qwen2.5-7B Llama3.1-8B Gemma2-9B 0 2 4 6 8 Time to First Token (s) Q4 K M (Time) IQ4 XS (Time) Q4 K M (Energy) IQ4 XS (Energy) 0 1 2 3 4 Energy per Token (Joules) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Time to First Token (Bars, Left Axis) and Energy per [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Multi-objective trade-offs across models and quantization schemes. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Total energy per run across models and quantization schemes. Distributions summarize 30 repetitions per configuration. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a replicable and reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality of LLMs on mobile devices. We harness this pipeline to conduct an empirical case study on a flagship Android device, capturing granular metrics across eight LLMs ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. The findings highlight the trade-offs between generation quality, performance, power and resource consumption, revealing which LLMs offer the best balance across metrics and under different conditions. Besides, we uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a replicable experimental pipeline for profiling energy consumption, latency, generation quality, and resource use of LLMs on a non-rooted flagship Android device. In a case study with eight models (0.5B–9B parameters), it reports multi-objective trade-offs and identifies a quantization energy paradox: importance-aware quantization reduces memory footprints but yields negligible energy savings relative to standard mixed-precision methods, leading to the conclusion that model architecture—not quantization scheme—is the decisive factor for battery life. The work also notes that Mixture-of-Experts architectures maintain low energy profiles despite larger storage requirements and identifies mid-sized models (e.g., Qwen2.5-3B) as pragmatic sweet spots balancing quality and sustainability.

Significance. If the measurements prove robust, the paper supplies valuable real-device data that challenges common assumptions about quantization benefits for energy efficiency in on-device LLMs. The replicable pipeline without root access and the concrete multi-metric findings constitute clear strengths that support reproducibility and practical guidance. The quantization energy paradox and the efficiency observations for MoE models could usefully inform architecture choices for sustainable edge deployment.

major comments (2)

[§3 (Experimental Pipeline) and quantization results] §3 (Experimental Pipeline) and the quantization results: The central paradox claim—that importance-aware quantization produces negligible energy savings versus mixed-precision—rests on energy deltas measured via public non-rooted Android APIs. These readings are susceptible to thermal throttling, background processes, and frequency scaling; without reported per-trial variance, error bars, or statistical tests comparing the small deltas, it is unclear whether the observed differences exceed measurement noise and can support the strong conclusion that architecture alone is decisive.
[Results on MoE models] Results on MoE models: The claim that MoE architectures combine 7B-scale storage with 1B–2B energy profiles requires explicit controls or ablations showing that the energy savings arise from sparse activation rather than other model-specific factors (e.g., layer widths or token throughput). Absent such detail, the deviation from the standard size-energy trend remains suggestive rather than conclusive.

minor comments (3)

[Abstract] Abstract: The sentence 'This proves that...' overstates an empirical observation; rephrase to 'suggests' or 'indicates' to reflect the measurement-based nature of the finding.
[Figures and tables] Figures and tables: Ensure all energy and latency plots include units, error bars where available, and legends that distinguish quantization variants clearly.
[Related work] Related work: Add citations to prior mobile LLM energy studies that used comparable Android APIs or external metering for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the presentation of our empirical findings. We address each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [§3 (Experimental Pipeline) and quantization results] §3 (Experimental Pipeline) and the quantization results: The central paradox claim—that importance-aware quantization produces negligible energy savings versus mixed-precision—rests on energy deltas measured via public non-rooted Android APIs. These readings are susceptible to thermal throttling, background processes, and frequency scaling; without reported per-trial variance, error bars, or statistical tests comparing the small deltas, it is unclear whether the observed differences exceed measurement noise and can support the strong conclusion that architecture alone is decisive.

Authors: We agree that explicit reporting of measurement variability is essential for supporting claims about small energy deltas. Our experimental protocol included repeated trials under controlled conditions to reduce the impact of background processes and thermal effects, but we did not include per-trial variance or formal statistical comparisons in the original submission. In the revised manuscript we will add error bars (standard deviation across runs) to all energy and latency plots and include paired statistical tests to establish that the observed differences between quantization schemes exceed measurement noise. These additions will provide clearer support for the conclusion that model architecture is the dominant factor. revision: yes
Referee: [Results on MoE models] Results on MoE models: The claim that MoE architectures combine 7B-scale storage with 1B–2B energy profiles requires explicit controls or ablations showing that the energy savings arise from sparse activation rather than other model-specific factors (e.g., layer widths or token throughput). Absent such detail, the deviation from the standard size-energy trend remains suggestive rather than conclusive.

Authors: We appreciate the call for greater isolation of the sparsity effect. Our results derive from head-to-head profiling of multiple models, including MoE variants, on identical hardware and workloads; the lower energy draw of the MoE models is consistent with their known sparse activation pattern. Dedicated ablations that hold all other architectural variables constant are not feasible with the publicly available models we evaluated. In the revision we will expand the discussion section to explicitly list potential confounding factors (layer widths, token throughput) and qualify the MoE observation as a comparative finding rather than a causal claim, while retaining the empirical trend as a useful practical signal for practitioners. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements and observations stand independently of any fitted derivations.

full rationale

The paper describes construction of an experimental pipeline for direct profiling of energy, latency, quality, and memory on a non-rooted Android device across eight LLMs. All central claims, including the quantization energy paradox and the conclusion that model architecture dominates battery life, are presented as outcomes of these replicable measurements rather than predictions derived from equations, parameters fitted to the same data, or self-cited uniqueness theorems. No load-bearing step reduces by construction to its own inputs; the work reports observed trade-offs under stated conditions without renaming known results or smuggling ansatzes via prior citations. This is the expected finding for a measurement-driven empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical measurements rather than mathematical derivations; no free parameters are fitted to produce the paradox, and no new entities are postulated.

axioms (1)

domain assumption Measurements taken without root access on a single flagship Android device accurately reflect typical user energy and latency under realistic conditions.
Invoked when the authors state findings reflect realistic user conditions without root.

pith-pipeline@v0.9.0 · 5843 in / 1253 out tokens · 48622 ms · 2026-05-21T09:23:11.417204+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints... the architecture of the model, not its quantization scheme, is the decisive factor.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mixture-of-Experts (MoE) architectures defy the standard size-energy trend...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 11 internal anchors

[1]

Edge computing: Vision and challenges,

W. Shi, J. Cao, Q. Zhang, Y . Li, and L. Xu, “Edge computing: Vision and challenges,”IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016

work page 2016
[2]

Sustainable AI: Environmental implications, challenges and opportunities,

C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Baiet al., “Sustainable AI: Environmental implications, challenges and opportunities,”Proceedings of Machine Learning and Systems, vol. 4, pp. 795–813, 2022

work page 2022
[3]

Efficiently scaling transformer inference,

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,”Proceedings of Machine Learning and Systems, vol. 5, 2023

work page 2023
[4]

DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,

R. Y . Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Fanget al., “DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15

work page 2022
[5]

[Online]

Monsoon Solutions, Inc.,High V oltage Power Monitor (P/N: AAA10F) User Manual, Monsoon Solutions, Inc., Bellevue, WA, USA, 2024. [Online]. Available: https://www.msoon.com/high-voltage-power-monitor

work page 2024
[6]

Evaluating the effectiveness of model-based power characterization,

J. C. McCullough, Y . Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” inProceedings of the 2011 USENIX Annual Technical Conference (USENIX ATC ’11), 2011

work page 2011
[7]

AppScope: Application energy metering framework for Android smartphones using kernel activity monitoring,

C. Yoon, D. Kim, W. Jung, C. Kang, and H. Cha, “AppScope: Application energy metering framework for Android smartphones using kernel activity monitoring,” inProceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC ’12), 2012

work page 2012
[8]

llama.cpp: LLM inference in C/C++,

G. Gerganov and llama.cpp contributors, “llama.cpp: LLM inference in C/C++,” https://github.com/ggml-org/llama.cpp, 2023

work page 2023
[9]

BERTScore: Evaluating text generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inProceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, Apr 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr

work page 2020
[10]

G-Eval: NLG evaluation using GPT-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG evaluation using GPT-4 with better human alignment,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore: Association for Computational Linguistics, Dec 2023, pp. 2511–2522. [Online]. Available: https://aclanthology.org/2023.emnlp-main.153

work page 2023
[11]

Green AI,

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

work page 2020
[12]

Energy and policy consid- erations for deep learning in NLP,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

work page 2019
[13]

The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,

Y . Yuan, J. Zhang, Z. Zhang, K. Chen, J. Shi, V . Stoico, and I. Malavolta, “The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,” inProceedings of the 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI (CAIN ’24). Lisbon, Portugal: ACM, 2024

work page 2024
[14]

On-device or remote? on the energy efficiency of fetching llm-generated content,

V . Nguyen, V . Dhopate, H. Huynh, H. Bouhlal, A. Annengala, G. L. Scoccia, M. Martinez, V . Stoico, and I. Malavolta, “On-device or remote? on the energy efficiency of fetching llm-generated content,” inProceedings of the 2025 IEEE/ACM 4th International Conference on AI Engineering - Software Engineering for AI (CAIN ’25). IEEE, 2025, pp. 72–82

work page 2025
[15]

Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge,

M. Abstreiter, “Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge,” inProceedings of the 4th Workshop on Machine Learning and Systems (EuroMLSys ’24). Athens, Greece: ACM, 2024, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3642970.3655835

work page doi:10.1145/3642970.3655835 2024
[16]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, F. Seide, S. Hanet al., “Smoothquant: Accurate and efficient post-training quantization for large language models,” in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[17]

GPTQ: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” in Proceedings of the 11th International Conference on Learning Represen- tations (ICLR), 2023

work page 2023
[18]

LLM.int8(): 8-bit matrix multiplication for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 30 318–30 332

work page 2022
[19]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” inProceedings of the 7th MLSys Conference (MLSys 2024), 2024, santa Clara, CA. [Online]. Available: https://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

TVM: An automated end-to-end optimizing compiler for deep learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Cezeet al., “TVM: An automated end-to-end optimizing compiler for deep learning,” inProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 578–594

work page 2018
[21]

MELTing Point: Mobile evaluation of language transformers,

S. Laskaridis, K. Katevas, L. Minto, and H. Haddadi, “MELTing Point: Mobile evaluation of language transformers,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking (MobiCom ’24). Washington D.C., USA: ACM, Nov 2024, pp. 890–907. [Online]. Available: https://doi.org/10.1145/3636534.3690668

work page doi:10.1145/3636534.3690668 2024
[22]

FlashAttention: Fast and memory-efficient exact attention with IO-awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 16 344–16 359

work page 2022
[23]

An analysis of power consumption in a smartphone,

A. Carroll and G. Heiser, “An analysis of power consumption in a smartphone,” inProceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC ’10), vol. 14, Boston, MA, 2010, pp. 21–21

work page 2010
[24]

Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,

A. Pathak, Y . C. Hu, and M. Zhang, “Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,” in Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys ’11), 2011, pp. 29–42

work page 2011
[25]

Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,

S2-group, “Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,” https://github.com/S2-group/ batterymanager-companion/, 2024

work page 2024
[26]

Green mining: investigating power consumption across versions,

A. Hindle, A. Wilson, K. Rasmussen, E. J. Jedwab, R. Godfrey, and P. Sweeney, “Green mining: investigating power consumption across versions,” inProceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE, 2012, pp. 1305–1308

work page 2012
[27]

A framework for the automatic execution of measurement-based experiments on android devices,

I. Malavolta, E. M. Grua, C.-Y . Lam, R. de Vries, F. Tan, E. Zielinski, M. Peters, and L. Kaandorp, “A framework for the automatic execution of measurement-based experiments on android devices,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE ’20). ACM/IEEE, 2020

work page 2020
[28]

Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,

M. Karsten, A. C. Dragomir, R. Apsan, V . Stoico, and I. Malavolta, “Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,”Science of Computer Programming, vol. 239, p. 103415, Jan 2025

work page 2025
[29]

Trepn power profiler,

Qualcomm Technologies, Inc., “Trepn power profiler,” Qualcomm Developer Network, 2024, accessed: 2026-02-12. [Online]. Available: https://developer.qualcomm.com/forums/software/trepn-power-profiler

work page 2024
[30]

Codecarbon: Estimate and track carbon emissions from machine learning computing,

V . Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank, J. Wilson, S. Friedler, and S. Luccioni, “Codecarbon: Estimate and track carbon emissions from machine learning computing,” 2021

work page 2021
[31]

Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,

J. Sallou, L. Cruz, and T. Durieux, “Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,”arXiv preprint arXiv:2312.13897, 2023

work page arXiv 2023
[32]

Open LLM Leaderboard,

E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lambert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf, “Open LLM Leaderboard,” Hugging Face Space, 2023. [Online]. Available: https://huggingface.co/ spaces/open-llm-leaderboard/open llm leaderboard

work page 2023
[33]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 27 730–27 744

work page 2022
[34]

Finetuned language models are zero-shot learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inProceedings of the 9th International Conference on Learning Representations (ICLR), 2021

work page 2021
[35]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024. [Online]. Available: https: //arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Qwen2.5 Technical Report

Qwen Team, “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Phi-2: The surprising power of small language models,

M. Javaheripi and S. Bubeck, “Phi-2: The surprising power of small language models,” Microsoft Research Blog, Dec 2023. [Online]. Available: https://www.microsoft.com/en-us/research/blog/ phi-2-the-surprising-power-of-small-language-models/

work page 2023
[39]

OLMoE: Open Mixture-of-Experts Language Models

[Online]. Available: https://arxiv.org/abs/2409.02060

work page internal anchor Pith review Pith/arXiv arXiv
[40]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letak, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Gemma 2: Improving Open Language Models at a Practical Size

G. DeepMind, “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024. [Online]. Available: https://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

AI benchmark: Running deep neural networks on Android smartphones,

A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool, “AI benchmark: Running deep neural networks on Android smartphones,” inProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0

work page 2018
[43]

llama.cpp quantize tool,

G. Gerganov and llama.cpp contributors, “llama.cpp quantize tool,” 2023, accessed: 2026-02-08. [Online]. Available: https://github.com/ ggml-org/llama.cpp/tree/master/examples/quantize

work page 2023
[44]

Dettmers, M

T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //arxiv.org/abs/2110.02861

work page arXiv 2022
[45]

Android Debug Bridge (ADB),

Google, “Android Debug Bridge (ADB),” 2024, accessed: 2026-02-12. [Online]. Available: https://developer.android.com/tools/adb

work page 2024
[46]

Understanding the energy consumption of Android app idle states,

M. A. Hoque, M. Siekkinen, and J. K. Nurminen, “Understanding the energy consumption of Android app idle states,”Pervasive and Mobile Computing, vol. 24, pp. 68–86, 2015

work page 2015
[47]

Power side-channel attacks on mobile devices: A survey,

M. Li, Y . Gao, S. F. Al-Sarawi, and D. Abbott, “Power side-channel attacks on mobile devices: A survey,”IEEE Access, vol. 10, pp. 6718– 6736, 2022

work page 2022
[48]

Android BatteryManager API reference,

Google, “Android BatteryManager API reference,” 2024, accessed: 2026-02-08. [Online]. Available: https://developer.android.com/reference/ android/os/BatteryManager

work page 2024
[49]

SummEval: Re-evaluating summarization evaluation,

A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evaluation,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 2021

work page 2021
[50]

Efficient memory management for large language model serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and C. Re, “Efficient memory management for large language model serving with PagedAttention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23), 2023, pp. 611–626

work page 2023
[51]

Hitting the memory wall: implications of the obvious,

W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,”ACM SIGARCH Computer Architecture News, vol. 23, no. 1, pp. 20–24, 1995

work page 1995
[52]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[53]

TensorFlow Lite Micro: Embedded machine learning for TinyML systems,

R. David, J. Duke, A. Jain, V . J. Reddi, N. Jeffries, J. Li, N. Krentz, T. Cruesoe, and P. Warden, “TensorFlow Lite Micro: Embedded machine learning for TinyML systems,”Proceedings of Machine Learning and Systems, vol. 3, pp. 800–811, 2021

work page 2021
[54]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

Response time in man-computer conversational transac- tions,

R. B. Miller, “Response time in man-computer conversational transac- tions,” inProceedings of the fall joint computer conference, part I, 1968, pp. 267–277

work page 1968
[56]

Nielsen,Usability engineering

J. Nielsen,Usability engineering. Morgan Kaufmann, 1993

work page 1993
[57]

Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” inProceedings of the 3rd International Conference on Learning Representations (ICLR), 2015

work page 2015
[58]

Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,

J. Kong, S. W. Chung, and K. Choi, “Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,” inProceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE ’13). IEEE, 2013, pp. 1119–1124

work page 2013
[59]

W., and Keutzer, K

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” arXiv preprint arXiv:2103.13630, 2021

work page arXiv 2021
[60]

A White Paper on Neural Network Quantization

M. Nagel, M. Fournarakis, R. A. Amjad, Y . Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[61]

I- BERT: Integer-only BERT quantization,

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I- BERT: Integer-only BERT quantization,” inProceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 2021, pp. 5506–5518

work page 2021
[62]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

work page 2022
[63]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudonet al., “Mixture-of-experts with expert choice routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 7103–7114

work page 2022
[64]

MegaBlocks: Efficient sparse training with mixture-of-experts,

T. Gale, D. Narayanan, C. Young, and M. Zaharia, “MegaBlocks: Efficient sparse training with mixture-of-experts,”Proceedings of Machine Learning and Systems, vol. 5, pp. 288–304, 2023

work page 2023
[65]

Temperature-aware microarchitecture,

K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, “Temperature-aware microarchitecture,”ACM SIGARCH Computer Architecture News, vol. 32, no. 2, pp. 2–13, 2004

work page 2004
[66]

Power and energy characteriza- tion of ARM processors,

V . Keller, R. Lachaize, V . Gramoliet al., “Power and energy characteriza- tion of ARM processors,” inProceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2014, pp. 116–126

work page 2014
[67]

Dark silicon and the end of multicore scaling,

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,”IEEE Micro, vol. 32, no. 3, pp. 122–134, 2011

work page 2011
[68]

Towards sustainable AI: a comprehensive study of carbon footprints in large language models,

Y . Wang, Y . Li, X. Zheng, and H. Liu, “Towards sustainable AI: a comprehensive study of carbon footprints in large language models,” arXiv preprint arXiv:2310.03093, 2023

work page arXiv 2023
[69]

Understanding and mitigating the security risks of voice-driven interfaces,

C. Yan, X. Ji, K. Wang, Q. Jiang, Z. Jin, and W. Xu, “Understanding and mitigating the security risks of voice-driven interfaces,” inProceedings of the 29th USENIX Security Symposium (USENIX Security 20), 2020, pp. 2625–2642

work page 2020
[70]

Guidelines for conducting and reporting case study research in software engineering,

P. Runeson and M. H ¨ost, “Guidelines for conducting and reporting case study research in software engineering,”Empirical Software Engineering, vol. 14, no. 2, pp. 131–164, 2009

work page 2009
[71]

Software wear management for persistent memories,

V . Gogte, W. Wang, A. Kolli, and T. F. Wenisch, “Software wear management for persistent memories,” inProceedings of the 17th USENIX Conference on File and Storage Technologies (F AST ’19), 2019, pp. 45– 58

work page 2019
[72]

Fast inference from transform- ers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning (ICML). PMLR, 2023, pp. 19 274– 19 286

work page 2023
[73]

Taking AI to the edge: Arm’s new neural processing units,

S. Cass, “Taking AI to the edge: Arm’s new neural processing units,” IEEE Spectrum, vol. 56, no. 5, pp. 16–17, 2019

work page 2019
[74]

Lithium-ion battery degradation: what you need to know,

S. Pelletier, O. Jabali, G. Laporte, and M. Veneroni, “Lithium-ion battery degradation: what you need to know,”Physical Chemistry Chemical Physics, vol. 19, no. 32, pp. 21 231–21 245, 2017

work page 2017
[75]

MLPerf inference benchmark,

V . J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Maximov, T. Choudhury, D. Gregget al., “MLPerf inference benchmark,”ACM SIGARCH Computer Architecture News, vol. 48, no. 1, pp. 50–65, 2020

work page 2020
[76]

Judging LLM-as-a-judge with MT-Bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Hao, Z. Wu, J. Ba, Z. L. Jiang, Z. Wu, A. Mirza, Z. Liet al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS ’23), vol. 36, 2023

work page 2023
[77]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,”arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[78]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2023

work page 2023

[1] [1]

Edge computing: Vision and challenges,

W. Shi, J. Cao, Q. Zhang, Y . Li, and L. Xu, “Edge computing: Vision and challenges,”IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016

work page 2016

[2] [2]

Sustainable AI: Environmental implications, challenges and opportunities,

C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Baiet al., “Sustainable AI: Environmental implications, challenges and opportunities,”Proceedings of Machine Learning and Systems, vol. 4, pp. 795–813, 2022

work page 2022

[3] [3]

Efficiently scaling transformer inference,

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,”Proceedings of Machine Learning and Systems, vol. 5, 2023

work page 2023

[4] [4]

DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,

R. Y . Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Fanget al., “DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15

work page 2022

[5] [5]

[Online]

Monsoon Solutions, Inc.,High V oltage Power Monitor (P/N: AAA10F) User Manual, Monsoon Solutions, Inc., Bellevue, WA, USA, 2024. [Online]. Available: https://www.msoon.com/high-voltage-power-monitor

work page 2024

[6] [6]

Evaluating the effectiveness of model-based power characterization,

J. C. McCullough, Y . Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” inProceedings of the 2011 USENIX Annual Technical Conference (USENIX ATC ’11), 2011

work page 2011

[7] [7]

AppScope: Application energy metering framework for Android smartphones using kernel activity monitoring,

C. Yoon, D. Kim, W. Jung, C. Kang, and H. Cha, “AppScope: Application energy metering framework for Android smartphones using kernel activity monitoring,” inProceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC ’12), 2012

work page 2012

[8] [8]

llama.cpp: LLM inference in C/C++,

G. Gerganov and llama.cpp contributors, “llama.cpp: LLM inference in C/C++,” https://github.com/ggml-org/llama.cpp, 2023

work page 2023

[9] [9]

BERTScore: Evaluating text generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inProceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, Apr 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr

work page 2020

[10] [10]

G-Eval: NLG evaluation using GPT-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG evaluation using GPT-4 with better human alignment,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore: Association for Computational Linguistics, Dec 2023, pp. 2511–2522. [Online]. Available: https://aclanthology.org/2023.emnlp-main.153

work page 2023

[11] [11]

Green AI,

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green AI,” Communications of the ACM, vol. 63, no. 12, pp. 54–63, 2020

work page 2020

[12] [12]

Energy and policy consid- erations for deep learning in NLP,

E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

work page 2019

[13] [13]

The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,

Y . Yuan, J. Zhang, Z. Zhang, K. Chen, J. Shi, V . Stoico, and I. Malavolta, “The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,” inProceedings of the 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI (CAIN ’24). Lisbon, Portugal: ACM, 2024

work page 2024

[14] [14]

On-device or remote? on the energy efficiency of fetching llm-generated content,

V . Nguyen, V . Dhopate, H. Huynh, H. Bouhlal, A. Annengala, G. L. Scoccia, M. Martinez, V . Stoico, and I. Malavolta, “On-device or remote? on the energy efficiency of fetching llm-generated content,” inProceedings of the 2025 IEEE/ACM 4th International Conference on AI Engineering - Software Engineering for AI (CAIN ’25). IEEE, 2025, pp. 72–82

work page 2025

[15] [15]

Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge,

M. Abstreiter, “Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge,” inProceedings of the 4th Workshop on Machine Learning and Systems (EuroMLSys ’24). Athens, Greece: ACM, 2024, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3642970.3655835

work page doi:10.1145/3642970.3655835 2024

[16] [16]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, F. Seide, S. Hanet al., “Smoothquant: Accurate and efficient post-training quantization for large language models,” in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023

[17] [17]

GPTQ: Accurate post-training quantization for generative pre-trained transformers,

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” in Proceedings of the 11th International Conference on Learning Represen- tations (ICLR), 2023

work page 2023

[18] [18]

LLM.int8(): 8-bit matrix multiplication for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 30 318–30 332

work page 2022

[19] [19]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” inProceedings of the 7th MLSys Conference (MLSys 2024), 2024, santa Clara, CA. [Online]. Available: https://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

TVM: An automated end-to-end optimizing compiler for deep learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Cezeet al., “TVM: An automated end-to-end optimizing compiler for deep learning,” inProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 578–594

work page 2018

[21] [21]

MELTing Point: Mobile evaluation of language transformers,

S. Laskaridis, K. Katevas, L. Minto, and H. Haddadi, “MELTing Point: Mobile evaluation of language transformers,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking (MobiCom ’24). Washington D.C., USA: ACM, Nov 2024, pp. 890–907. [Online]. Available: https://doi.org/10.1145/3636534.3690668

work page doi:10.1145/3636534.3690668 2024

[22] [22]

FlashAttention: Fast and memory-efficient exact attention with IO-awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 16 344–16 359

work page 2022

[23] [23]

An analysis of power consumption in a smartphone,

A. Carroll and G. Heiser, “An analysis of power consumption in a smartphone,” inProceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC ’10), vol. 14, Boston, MA, 2010, pp. 21–21

work page 2010

[24] [24]

Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,

A. Pathak, Y . C. Hu, and M. Zhang, “Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,” in Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys ’11), 2011, pp. 29–42

work page 2011

[25] [25]

Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,

S2-group, “Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,” https://github.com/S2-group/ batterymanager-companion/, 2024

work page 2024

[26] [26]

Green mining: investigating power consumption across versions,

A. Hindle, A. Wilson, K. Rasmussen, E. J. Jedwab, R. Godfrey, and P. Sweeney, “Green mining: investigating power consumption across versions,” inProceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE, 2012, pp. 1305–1308

work page 2012

[27] [27]

A framework for the automatic execution of measurement-based experiments on android devices,

I. Malavolta, E. M. Grua, C.-Y . Lam, R. de Vries, F. Tan, E. Zielinski, M. Peters, and L. Kaandorp, “A framework for the automatic execution of measurement-based experiments on android devices,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE ’20). ACM/IEEE, 2020

work page 2020

[28] [28]

Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,

M. Karsten, A. C. Dragomir, R. Apsan, V . Stoico, and I. Malavolta, “Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,”Science of Computer Programming, vol. 239, p. 103415, Jan 2025

work page 2025

[29] [29]

Trepn power profiler,

Qualcomm Technologies, Inc., “Trepn power profiler,” Qualcomm Developer Network, 2024, accessed: 2026-02-12. [Online]. Available: https://developer.qualcomm.com/forums/software/trepn-power-profiler

work page 2024

[30] [30]

Codecarbon: Estimate and track carbon emissions from machine learning computing,

V . Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank, J. Wilson, S. Friedler, and S. Luccioni, “Codecarbon: Estimate and track carbon emissions from machine learning computing,” 2021

work page 2021

[31] [31]

Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,

J. Sallou, L. Cruz, and T. Durieux, “Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,”arXiv preprint arXiv:2312.13897, 2023

work page arXiv 2023

[32] [32]

Open LLM Leaderboard,

E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lambert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf, “Open LLM Leaderboard,” Hugging Face Space, 2023. [Online]. Available: https://huggingface.co/ spaces/open-llm-leaderboard/open llm leaderboard

work page 2023

[33] [33]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 27 730–27 744

work page 2022

[34] [34]

Finetuned language models are zero-shot learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inProceedings of the 9th International Conference on Learning Representations (ICLR), 2021

work page 2021

[35] [35]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024. [Online]. Available: https: //arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Qwen2.5 Technical Report

Qwen Team, “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Phi-2: The surprising power of small language models,

M. Javaheripi and S. Bubeck, “Phi-2: The surprising power of small language models,” Microsoft Research Blog, Dec 2023. [Online]. Available: https://www.microsoft.com/en-us/research/blog/ phi-2-the-surprising-power-of-small-language-models/

work page 2023

[38] [39]

OLMoE: Open Mixture-of-Experts Language Models

[Online]. Available: https://arxiv.org/abs/2409.02060

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letak, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [41]

Gemma 2: Improving Open Language Models at a Practical Size

G. DeepMind, “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024. [Online]. Available: https://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

AI benchmark: Running deep neural networks on Android smartphones,

A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool, “AI benchmark: Running deep neural networks on Android smartphones,” inProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0

work page 2018

[42] [43]

llama.cpp quantize tool,

G. Gerganov and llama.cpp contributors, “llama.cpp quantize tool,” 2023, accessed: 2026-02-08. [Online]. Available: https://github.com/ ggml-org/llama.cpp/tree/master/examples/quantize

work page 2023

[43] [44]

Dettmers, M

T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //arxiv.org/abs/2110.02861

work page arXiv 2022

[44] [45]

Android Debug Bridge (ADB),

Google, “Android Debug Bridge (ADB),” 2024, accessed: 2026-02-12. [Online]. Available: https://developer.android.com/tools/adb

work page 2024

[45] [46]

Understanding the energy consumption of Android app idle states,

M. A. Hoque, M. Siekkinen, and J. K. Nurminen, “Understanding the energy consumption of Android app idle states,”Pervasive and Mobile Computing, vol. 24, pp. 68–86, 2015

work page 2015

[46] [47]

Power side-channel attacks on mobile devices: A survey,

M. Li, Y . Gao, S. F. Al-Sarawi, and D. Abbott, “Power side-channel attacks on mobile devices: A survey,”IEEE Access, vol. 10, pp. 6718– 6736, 2022

work page 2022

[47] [48]

Android BatteryManager API reference,

Google, “Android BatteryManager API reference,” 2024, accessed: 2026-02-08. [Online]. Available: https://developer.android.com/reference/ android/os/BatteryManager

work page 2024

[48] [49]

SummEval: Re-evaluating summarization evaluation,

A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evaluation,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 2021

work page 2021

[49] [50]

Efficient memory management for large language model serving with PagedAttention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and C. Re, “Efficient memory management for large language model serving with PagedAttention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23), 2023, pp. 611–626

work page 2023

[50] [51]

Hitting the memory wall: implications of the obvious,

W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,”ACM SIGARCH Computer Architecture News, vol. 23, no. 1, pp. 20–24, 1995

work page 1995

[51] [52]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[52] [53]

TensorFlow Lite Micro: Embedded machine learning for TinyML systems,

R. David, J. Duke, A. Jain, V . J. Reddi, N. Jeffries, J. Li, N. Krentz, T. Cruesoe, and P. Warden, “TensorFlow Lite Micro: Embedded machine learning for TinyML systems,”Proceedings of Machine Learning and Systems, vol. 3, pp. 800–811, 2021

work page 2021

[53] [54]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [55]

Response time in man-computer conversational transac- tions,

R. B. Miller, “Response time in man-computer conversational transac- tions,” inProceedings of the fall joint computer conference, part I, 1968, pp. 267–277

work page 1968

[55] [56]

Nielsen,Usability engineering

J. Nielsen,Usability engineering. Morgan Kaufmann, 1993

work page 1993

[56] [57]

Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” inProceedings of the 3rd International Conference on Learning Representations (ICLR), 2015

work page 2015

[57] [58]

Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,

J. Kong, S. W. Chung, and K. Choi, “Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,” inProceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE ’13). IEEE, 2013, pp. 1119–1124

work page 2013

[58] [59]

W., and Keutzer, K

A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” arXiv preprint arXiv:2103.13630, 2021

work page arXiv 2021

[59] [60]

A White Paper on Neural Network Quantization

M. Nagel, M. Fournarakis, R. A. Amjad, Y . Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [61]

I- BERT: Integer-only BERT quantization,

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I- BERT: Integer-only BERT quantization,” inProceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 2021, pp. 5506–5518

work page 2021

[61] [62]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

work page 2022

[62] [63]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudonet al., “Mixture-of-experts with expert choice routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 7103–7114

work page 2022

[63] [64]

MegaBlocks: Efficient sparse training with mixture-of-experts,

T. Gale, D. Narayanan, C. Young, and M. Zaharia, “MegaBlocks: Efficient sparse training with mixture-of-experts,”Proceedings of Machine Learning and Systems, vol. 5, pp. 288–304, 2023

work page 2023

[64] [65]

Temperature-aware microarchitecture,

K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, “Temperature-aware microarchitecture,”ACM SIGARCH Computer Architecture News, vol. 32, no. 2, pp. 2–13, 2004

work page 2004

[65] [66]

Power and energy characteriza- tion of ARM processors,

V . Keller, R. Lachaize, V . Gramoliet al., “Power and energy characteriza- tion of ARM processors,” inProceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2014, pp. 116–126

work page 2014

[66] [67]

Dark silicon and the end of multicore scaling,

H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,”IEEE Micro, vol. 32, no. 3, pp. 122–134, 2011

work page 2011

[67] [68]

Towards sustainable AI: a comprehensive study of carbon footprints in large language models,

Y . Wang, Y . Li, X. Zheng, and H. Liu, “Towards sustainable AI: a comprehensive study of carbon footprints in large language models,” arXiv preprint arXiv:2310.03093, 2023

work page arXiv 2023

[68] [69]

Understanding and mitigating the security risks of voice-driven interfaces,

C. Yan, X. Ji, K. Wang, Q. Jiang, Z. Jin, and W. Xu, “Understanding and mitigating the security risks of voice-driven interfaces,” inProceedings of the 29th USENIX Security Symposium (USENIX Security 20), 2020, pp. 2625–2642

work page 2020

[69] [70]

Guidelines for conducting and reporting case study research in software engineering,

P. Runeson and M. H ¨ost, “Guidelines for conducting and reporting case study research in software engineering,”Empirical Software Engineering, vol. 14, no. 2, pp. 131–164, 2009

work page 2009

[70] [71]

Software wear management for persistent memories,

V . Gogte, W. Wang, A. Kolli, and T. F. Wenisch, “Software wear management for persistent memories,” inProceedings of the 17th USENIX Conference on File and Storage Technologies (F AST ’19), 2019, pp. 45– 58

work page 2019

[71] [72]

Fast inference from transform- ers via speculative decoding,

Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning (ICML). PMLR, 2023, pp. 19 274– 19 286

work page 2023

[72] [73]

Taking AI to the edge: Arm’s new neural processing units,

S. Cass, “Taking AI to the edge: Arm’s new neural processing units,” IEEE Spectrum, vol. 56, no. 5, pp. 16–17, 2019

work page 2019

[73] [74]

Lithium-ion battery degradation: what you need to know,

S. Pelletier, O. Jabali, G. Laporte, and M. Veneroni, “Lithium-ion battery degradation: what you need to know,”Physical Chemistry Chemical Physics, vol. 19, no. 32, pp. 21 231–21 245, 2017

work page 2017

[74] [75]

MLPerf inference benchmark,

V . J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Maximov, T. Choudhury, D. Gregget al., “MLPerf inference benchmark,”ACM SIGARCH Computer Architecture News, vol. 48, no. 1, pp. 50–65, 2020

work page 2020

[75] [76]

Judging LLM-as-a-judge with MT-Bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Hao, Z. Wu, J. Ba, Z. L. Jiang, Z. Wu, A. Mirza, Z. Liet al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS ’23), vol. 36, 2023

work page 2023

[76] [77]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,”arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[77] [78]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [79]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2023

work page 2023