Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence
Pith reviewed 2026-05-21 09:23 UTC · model grok-4.3
The pith
Quantization reduces memory for on-device LLMs but yields negligible energy savings, making architecture the key to battery life.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors constructed a replicable experimental pipeline to profile the interplay between energy consumption, latency, and generation quality of LLMs on a flagship Android device. They uncovered a quantization energy paradox in which importance-aware quantization reduces memory footprints to fit larger models into RAM but yields negligible energy savings compared to standard mixed-precision methods. This establishes that model architecture, rather than quantization scheme, is the decisive factor for battery life. Mixture-of-Experts architectures store like 7B models yet maintain the lower energy profile of 1B to 2B models. Mid-sized models such as Qwen2.5-3B balance response quality with a
What carries the argument
The quantization energy paradox, which shows that importance-aware quantization fits larger models into RAM but saves little energy compared to mixed-precision methods and therefore makes architecture the controlling factor for power use.
If this is right
- For battery-limited phones, selecting models with efficient architectures such as Mixture-of-Experts permits larger capacity without proportional increases in energy cost.
- Developers can rely on standard mixed-precision quantization rather than more complex importance-aware methods without losing battery performance.
- Mid-sized models provide the clearest practical compromise among quality, energy draw, and resource use under real device constraints.
- On-device deployment for privacy and offline use becomes more feasible once the right model size and architecture are chosen.
Where Pith is reading between the lines
- Designers of future edge LLMs should target architectural efficiency rather than further quantization refinements to improve real-world sustainability.
- Benchmarking tools for mobile AI should incorporate architecture-specific energy profiles instead of relying mainly on parameter count or quantization level.
- Extending the same profiling approach to other hardware platforms could test whether the dominance of architecture over quantization generalizes.
Load-bearing premise
The measurements taken on a single flagship Android device without root access accurately reflect typical user energy consumption and latency without being dominated by thermal throttling or background processes.
What would settle it
Repeating the same model runs on additional devices or with root-level power tracing and observing large energy reductions from importance-aware quantization would falsify the claim that architecture alone determines battery life.
Figures
read the original abstract
The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a replicable and reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality of LLMs on mobile devices. We harness this pipeline to conduct an empirical case study on a flagship Android device, capturing granular metrics across eight LLMs ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. The findings highlight the trade-offs between generation quality, performance, power and resource consumption, revealing which LLMs offer the best balance across metrics and under different conditions. Besides, we uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a replicable experimental pipeline for profiling energy consumption, latency, generation quality, and resource use of LLMs on a non-rooted flagship Android device. In a case study with eight models (0.5B–9B parameters), it reports multi-objective trade-offs and identifies a quantization energy paradox: importance-aware quantization reduces memory footprints but yields negligible energy savings relative to standard mixed-precision methods, leading to the conclusion that model architecture—not quantization scheme—is the decisive factor for battery life. The work also notes that Mixture-of-Experts architectures maintain low energy profiles despite larger storage requirements and identifies mid-sized models (e.g., Qwen2.5-3B) as pragmatic sweet spots balancing quality and sustainability.
Significance. If the measurements prove robust, the paper supplies valuable real-device data that challenges common assumptions about quantization benefits for energy efficiency in on-device LLMs. The replicable pipeline without root access and the concrete multi-metric findings constitute clear strengths that support reproducibility and practical guidance. The quantization energy paradox and the efficiency observations for MoE models could usefully inform architecture choices for sustainable edge deployment.
major comments (2)
- [§3 (Experimental Pipeline) and quantization results] §3 (Experimental Pipeline) and the quantization results: The central paradox claim—that importance-aware quantization produces negligible energy savings versus mixed-precision—rests on energy deltas measured via public non-rooted Android APIs. These readings are susceptible to thermal throttling, background processes, and frequency scaling; without reported per-trial variance, error bars, or statistical tests comparing the small deltas, it is unclear whether the observed differences exceed measurement noise and can support the strong conclusion that architecture alone is decisive.
- [Results on MoE models] Results on MoE models: The claim that MoE architectures combine 7B-scale storage with 1B–2B energy profiles requires explicit controls or ablations showing that the energy savings arise from sparse activation rather than other model-specific factors (e.g., layer widths or token throughput). Absent such detail, the deviation from the standard size-energy trend remains suggestive rather than conclusive.
minor comments (3)
- [Abstract] Abstract: The sentence 'This proves that...' overstates an empirical observation; rephrase to 'suggests' or 'indicates' to reflect the measurement-based nature of the finding.
- [Figures and tables] Figures and tables: Ensure all energy and latency plots include units, error bars where available, and legends that distinguish quantization variants clearly.
- [Related work] Related work: Add citations to prior mobile LLM energy studies that used comparable Android APIs or external metering for context.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help strengthen the presentation of our empirical findings. We address each major comment below and indicate the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [§3 (Experimental Pipeline) and quantization results] §3 (Experimental Pipeline) and the quantization results: The central paradox claim—that importance-aware quantization produces negligible energy savings versus mixed-precision—rests on energy deltas measured via public non-rooted Android APIs. These readings are susceptible to thermal throttling, background processes, and frequency scaling; without reported per-trial variance, error bars, or statistical tests comparing the small deltas, it is unclear whether the observed differences exceed measurement noise and can support the strong conclusion that architecture alone is decisive.
Authors: We agree that explicit reporting of measurement variability is essential for supporting claims about small energy deltas. Our experimental protocol included repeated trials under controlled conditions to reduce the impact of background processes and thermal effects, but we did not include per-trial variance or formal statistical comparisons in the original submission. In the revised manuscript we will add error bars (standard deviation across runs) to all energy and latency plots and include paired statistical tests to establish that the observed differences between quantization schemes exceed measurement noise. These additions will provide clearer support for the conclusion that model architecture is the dominant factor. revision: yes
-
Referee: [Results on MoE models] Results on MoE models: The claim that MoE architectures combine 7B-scale storage with 1B–2B energy profiles requires explicit controls or ablations showing that the energy savings arise from sparse activation rather than other model-specific factors (e.g., layer widths or token throughput). Absent such detail, the deviation from the standard size-energy trend remains suggestive rather than conclusive.
Authors: We appreciate the call for greater isolation of the sparsity effect. Our results derive from head-to-head profiling of multiple models, including MoE variants, on identical hardware and workloads; the lower energy draw of the MoE models is consistent with their known sparse activation pattern. Dedicated ablations that hold all other architectural variables constant are not feasible with the publicly available models we evaluated. In the revision we will expand the discussion section to explicitly list potential confounding factors (layer widths, token throughput) and qualify the MoE observation as a comparative finding rather than a causal claim, while retaining the empirical trend as a useful practical signal for practitioners. revision: partial
Circularity Check
No circularity: empirical measurements and observations stand independently of any fitted derivations.
full rationale
The paper describes construction of an experimental pipeline for direct profiling of energy, latency, quality, and memory on a non-rooted Android device across eight LLMs. All central claims, including the quantization energy paradox and the conclusion that model architecture dominates battery life, are presented as outcomes of these replicable measurements rather than predictions derived from equations, parameters fitted to the same data, or self-cited uniqueness theorems. No load-bearing step reduces by construction to its own inputs; the work reports observed trade-offs under stated conditions without renaming known results or smuggling ansatzes via prior citations. This is the expected finding for a measurement-driven empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Measurements taken without root access on a single flagship Android device accurately reflect typical user energy and latency under realistic conditions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints... the architecture of the model, not its quantization scheme, is the decisive factor.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mixture-of-Experts (MoE) architectures defy the standard size-energy trend...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Edge computing: Vision and challenges,
W. Shi, J. Cao, Q. Zhang, Y . Li, and L. Xu, “Edge computing: Vision and challenges,”IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016
work page 2016
-
[2]
Sustainable AI: Environmental implications, challenges and opportunities,
C.-J. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F. Aga, J. Huang, C. Baiet al., “Sustainable AI: Environmental implications, challenges and opportunities,”Proceedings of Machine Learning and Systems, vol. 4, pp. 795–813, 2022
work page 2022
-
[3]
Efficiently scaling transformer inference,
R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,”Proceedings of Machine Learning and Systems, vol. 5, 2023
work page 2023
-
[4]
DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,
R. Y . Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Fanget al., “DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15
work page 2022
- [5]
-
[6]
Evaluating the effectiveness of model-based power characterization,
J. C. McCullough, Y . Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, “Evaluating the effectiveness of model-based power characterization,” inProceedings of the 2011 USENIX Annual Technical Conference (USENIX ATC ’11), 2011
work page 2011
-
[7]
C. Yoon, D. Kim, W. Jung, C. Kang, and H. Cha, “AppScope: Application energy metering framework for Android smartphones using kernel activity monitoring,” inProceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC ’12), 2012
work page 2012
-
[8]
llama.cpp: LLM inference in C/C++,
G. Gerganov and llama.cpp contributors, “llama.cpp: LLM inference in C/C++,” https://github.com/ggml-org/llama.cpp, 2023
work page 2023
-
[9]
BERTScore: Evaluating text generation with BERT,
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inProceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, Apr 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr
work page 2020
-
[10]
G-Eval: NLG evaluation using GPT-4 with better human alignment,
Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG evaluation using GPT-4 with better human alignment,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore: Association for Computational Linguistics, Dec 2023, pp. 2511–2522. [Online]. Available: https://aclanthology.org/2023.emnlp-main.153
work page 2023
- [11]
-
[12]
Energy and policy consid- erations for deep learning in NLP,
E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650
work page 2019
-
[13]
The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,
Y . Yuan, J. Zhang, Z. Zhang, K. Chen, J. Shi, V . Stoico, and I. Malavolta, “The impact of knowledge distillation on the energy consumption and runtime efficiency of nlp models,” inProceedings of the 2024 IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI (CAIN ’24). Lisbon, Portugal: ACM, 2024
work page 2024
-
[14]
On-device or remote? on the energy efficiency of fetching llm-generated content,
V . Nguyen, V . Dhopate, H. Huynh, H. Bouhlal, A. Annengala, G. L. Scoccia, M. Martinez, V . Stoico, and I. Malavolta, “On-device or remote? on the energy efficiency of fetching llm-generated content,” inProceedings of the 2025 IEEE/ACM 4th International Conference on AI Engineering - Software Engineering for AI (CAIN ’25). IEEE, 2025, pp. 72–82
work page 2025
-
[15]
M. Abstreiter, “Sometimes painful but certainly promising: Feasibility and trade-offs of language model inference at the edge,” inProceedings of the 4th Workshop on Machine Learning and Systems (EuroMLSys ’24). Athens, Greece: ACM, 2024, pp. 1–8. [Online]. Available: https://doi.org/10.1145/3642970.3655835
-
[16]
Smoothquant: Accurate and efficient post-training quantization for large language models,
G. Xiao, J. Lin, F. Seide, S. Hanet al., “Smoothquant: Accurate and efficient post-training quantization for large language models,” in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023
work page 2023
-
[17]
GPTQ: Accurate post-training quantization for generative pre-trained transformers,
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” in Proceedings of the 11th International Conference on Learning Represen- tations (ICLR), 2023
work page 2023
-
[18]
LLM.int8(): 8-bit matrix multiplication for transformers at scale,
T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 30 318–30 332
work page 2022
-
[19]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” inProceedings of the 7th MLSys Conference (MLSys 2024), 2024, santa Clara, CA. [Online]. Available: https://arxiv.org/abs/2306.00978
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
TVM: An automated end-to-end optimizing compiler for deep learning,
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Cezeet al., “TVM: An automated end-to-end optimizing compiler for deep learning,” inProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 578–594
work page 2018
-
[21]
MELTing Point: Mobile evaluation of language transformers,
S. Laskaridis, K. Katevas, L. Minto, and H. Haddadi, “MELTing Point: Mobile evaluation of language transformers,” inProceedings of the 30th Annual International Conference on Mobile Computing and Networking (MobiCom ’24). Washington D.C., USA: ACM, Nov 2024, pp. 890–907. [Online]. Available: https://doi.org/10.1145/3636534.3690668
-
[22]
FlashAttention: Fast and memory-efficient exact attention with IO-awareness,
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 16 344–16 359
work page 2022
-
[23]
An analysis of power consumption in a smartphone,
A. Carroll and G. Heiser, “An analysis of power consumption in a smartphone,” inProceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC ’10), vol. 14, Boston, MA, 2010, pp. 21–21
work page 2010
-
[24]
Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,
A. Pathak, Y . C. Hu, and M. Zhang, “Where is the energy spent inside my app? Fine grained energy accounting on smartphones with Eprof,” in Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys ’11), 2011, pp. 29–42
work page 2011
-
[25]
Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,
S2-group, “Batterymanager-companion: Companion app for the bat- terymanager plugin for android-runner,” https://github.com/S2-group/ batterymanager-companion/, 2024
work page 2024
-
[26]
Green mining: investigating power consumption across versions,
A. Hindle, A. Wilson, K. Rasmussen, E. J. Jedwab, R. Godfrey, and P. Sweeney, “Green mining: investigating power consumption across versions,” inProceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE, 2012, pp. 1305–1308
work page 2012
-
[27]
A framework for the automatic execution of measurement-based experiments on android devices,
I. Malavolta, E. M. Grua, C.-Y . Lam, R. de Vries, F. Tan, E. Zielinski, M. Peters, and L. Kaandorp, “A framework for the automatic execution of measurement-based experiments on android devices,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE ’20). ACM/IEEE, 2020
work page 2020
-
[28]
Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,
M. Karsten, A. C. Dragomir, R. Apsan, V . Stoico, and I. Malavolta, “Experiment Runner: A tool for the automatic orchestration of experiments targeting software systems,”Science of Computer Programming, vol. 239, p. 103415, Jan 2025
work page 2025
-
[29]
Qualcomm Technologies, Inc., “Trepn power profiler,” Qualcomm Developer Network, 2024, accessed: 2026-02-12. [Online]. Available: https://developer.qualcomm.com/forums/software/trepn-power-profiler
work page 2024
-
[30]
Codecarbon: Estimate and track carbon emissions from machine learning computing,
V . Schmidt, K. Goyal, A. Joshi, B. Feld, L. Conell, N. Laskaris, D. Blank, J. Wilson, S. Friedler, and S. Luccioni, “Codecarbon: Estimate and track carbon emissions from machine learning computing,” 2021
work page 2021
-
[31]
Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,
J. Sallou, L. Cruz, and T. Durieux, “Energibridge: Empowering soft- ware sustainability through cross-platform energy measurement,”arXiv preprint arXiv:2312.13897, 2023
-
[32]
E. Beeching, C. Fourrier, N. Habib, S. Han, N. Lambert, N. Rajani, O. Sanseviero, L. Tunstall, and T. Wolf, “Open LLM Leaderboard,” Hugging Face Space, 2023. [Online]. Available: https://huggingface.co/ spaces/open-llm-leaderboard/open llm leaderboard
work page 2023
-
[33]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 27 730–27 744
work page 2022
-
[34]
Finetuned language models are zero-shot learners,
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inProceedings of the 9th International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[35]
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huanget al., “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024. [Online]. Available: https: //arxiv.org/abs/2407.10671
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Qwen Team, “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024. [Online]. Available: https://arxiv.org/abs/ 2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Phi-2: The surprising power of small language models,
M. Javaheripi and S. Bubeck, “Phi-2: The surprising power of small language models,” Microsoft Research Blog, Dec 2023. [Online]. Available: https://www.microsoft.com/en-us/research/blog/ phi-2-the-surprising-power-of-small-language-models/
work page 2023
-
[39]
OLMoE: Open Mixture-of-Experts Language Models
[Online]. Available: https://arxiv.org/abs/2409.02060
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letak, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The Llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Gemma 2: Improving Open Language Models at a Practical Size
G. DeepMind, “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024. [Online]. Available: https://arxiv.org/abs/2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
AI benchmark: Running deep neural networks on Android smartphones,
A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool, “AI benchmark: Running deep neural networks on Android smartphones,” inProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0
work page 2018
-
[43]
G. Gerganov and llama.cpp contributors, “llama.cpp quantize tool,” 2023, accessed: 2026-02-08. [Online]. Available: https://github.com/ ggml-org/llama.cpp/tree/master/examples/quantize
work page 2023
-
[44]
T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //arxiv.org/abs/2110.02861
-
[45]
Google, “Android Debug Bridge (ADB),” 2024, accessed: 2026-02-12. [Online]. Available: https://developer.android.com/tools/adb
work page 2024
-
[46]
Understanding the energy consumption of Android app idle states,
M. A. Hoque, M. Siekkinen, and J. K. Nurminen, “Understanding the energy consumption of Android app idle states,”Pervasive and Mobile Computing, vol. 24, pp. 68–86, 2015
work page 2015
-
[47]
Power side-channel attacks on mobile devices: A survey,
M. Li, Y . Gao, S. F. Al-Sarawi, and D. Abbott, “Power side-channel attacks on mobile devices: A survey,”IEEE Access, vol. 10, pp. 6718– 6736, 2022
work page 2022
-
[48]
Android BatteryManager API reference,
Google, “Android BatteryManager API reference,” 2024, accessed: 2026-02-08. [Online]. Available: https://developer.android.com/reference/ android/os/BatteryManager
work page 2024
-
[49]
SummEval: Re-evaluating summarization evaluation,
A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evaluation,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 2021
work page 2021
-
[50]
Efficient memory management for large language model serving with PagedAttention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and C. Re, “Efficient memory management for large language model serving with PagedAttention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23), 2023, pp. 611–626
work page 2023
-
[51]
Hitting the memory wall: implications of the obvious,
W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,”ACM SIGARCH Computer Architecture News, vol. 23, no. 1, pp. 20–24, 1995
work page 1995
-
[52]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[53]
TensorFlow Lite Micro: Embedded machine learning for TinyML systems,
R. David, J. Duke, A. Jain, V . J. Reddi, N. Jeffries, J. Li, N. Krentz, T. Cruesoe, and P. Warden, “TensorFlow Lite Micro: Embedded machine learning for TinyML systems,”Proceedings of Machine Learning and Systems, vol. 3, pp. 800–811, 2021
work page 2021
-
[54]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
Response time in man-computer conversational transac- tions,
R. B. Miller, “Response time in man-computer conversational transac- tions,” inProceedings of the fall joint computer conference, part I, 1968, pp. 267–277
work page 1968
-
[56]
J. Nielsen,Usability engineering. Morgan Kaufmann, 1993
work page 1993
-
[57]
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” inProceedings of the 3rd International Conference on Learning Representations (ICLR), 2015
work page 2015
-
[58]
Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,
J. Kong, S. W. Chung, and K. Choi, “Energy-efficient thermal manage- ment for multiprocessor systems-on-chip,” inProceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE ’13). IEEE, 2013, pp. 1119–1124
work page 2013
-
[59]
A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” arXiv preprint arXiv:2103.13630, 2021
-
[60]
A White Paper on Neural Network Quantization
M. Nagel, M. Fournarakis, R. A. Amjad, Y . Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[61]
I- BERT: Integer-only BERT quantization,
S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I- BERT: Integer-only BERT quantization,” inProceedings of the 38th International Conference on Machine Learning (ICML). PMLR, 2021, pp. 5506–5518
work page 2021
-
[62]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022
work page 2022
-
[63]
Mixture-of-experts with expert choice routing,
Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudonet al., “Mixture-of-experts with expert choice routing,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS ’22), vol. 35, 2022, pp. 7103–7114
work page 2022
-
[64]
MegaBlocks: Efficient sparse training with mixture-of-experts,
T. Gale, D. Narayanan, C. Young, and M. Zaharia, “MegaBlocks: Efficient sparse training with mixture-of-experts,”Proceedings of Machine Learning and Systems, vol. 5, pp. 288–304, 2023
work page 2023
-
[65]
Temperature-aware microarchitecture,
K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, “Temperature-aware microarchitecture,”ACM SIGARCH Computer Architecture News, vol. 32, no. 2, pp. 2–13, 2004
work page 2004
-
[66]
Power and energy characteriza- tion of ARM processors,
V . Keller, R. Lachaize, V . Gramoliet al., “Power and energy characteriza- tion of ARM processors,” inProceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2014, pp. 116–126
work page 2014
-
[67]
Dark silicon and the end of multicore scaling,
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,”IEEE Micro, vol. 32, no. 3, pp. 122–134, 2011
work page 2011
-
[68]
Towards sustainable AI: a comprehensive study of carbon footprints in large language models,
Y . Wang, Y . Li, X. Zheng, and H. Liu, “Towards sustainable AI: a comprehensive study of carbon footprints in large language models,” arXiv preprint arXiv:2310.03093, 2023
-
[69]
Understanding and mitigating the security risks of voice-driven interfaces,
C. Yan, X. Ji, K. Wang, Q. Jiang, Z. Jin, and W. Xu, “Understanding and mitigating the security risks of voice-driven interfaces,” inProceedings of the 29th USENIX Security Symposium (USENIX Security 20), 2020, pp. 2625–2642
work page 2020
-
[70]
Guidelines for conducting and reporting case study research in software engineering,
P. Runeson and M. H ¨ost, “Guidelines for conducting and reporting case study research in software engineering,”Empirical Software Engineering, vol. 14, no. 2, pp. 131–164, 2009
work page 2009
-
[71]
Software wear management for persistent memories,
V . Gogte, W. Wang, A. Kolli, and T. F. Wenisch, “Software wear management for persistent memories,” inProceedings of the 17th USENIX Conference on File and Storage Technologies (F AST ’19), 2019, pp. 45– 58
work page 2019
-
[72]
Fast inference from transform- ers via speculative decoding,
Y . Leviathan, M. Kalman, and Y . Matias, “Fast inference from transform- ers via speculative decoding,” inProceedings of the 40th International Conference on Machine Learning (ICML). PMLR, 2023, pp. 19 274– 19 286
work page 2023
-
[73]
Taking AI to the edge: Arm’s new neural processing units,
S. Cass, “Taking AI to the edge: Arm’s new neural processing units,” IEEE Spectrum, vol. 56, no. 5, pp. 16–17, 2019
work page 2019
-
[74]
Lithium-ion battery degradation: what you need to know,
S. Pelletier, O. Jabali, G. Laporte, and M. Veneroni, “Lithium-ion battery degradation: what you need to know,”Physical Chemistry Chemical Physics, vol. 19, no. 32, pp. 21 231–21 245, 2017
work page 2017
-
[75]
V . J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson, M. Maximov, T. Choudhury, D. Gregget al., “MLPerf inference benchmark,”ACM SIGARCH Computer Architecture News, vol. 48, no. 1, pp. 50–65, 2020
work page 2020
-
[76]
Judging LLM-as-a-judge with MT-Bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Hao, Z. Wu, J. Ba, Z. L. Jiang, Z. Wu, A. Mirza, Z. Liet al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” inProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS ’23), vol. 36, 2023
work page 2023
-
[77]
Carbon Emissions and Large Neural Network Training
D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon emissions and large neural network training,”arXiv preprint arXiv:2104.10350, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[78]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”arXiv preprint arXiv:2305.05176, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Efficient streaming language models with attention sinks,
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.