pith. machine review for the scientific record. sign in

arxiv: 2604.07857 · v1 · submitted 2026-04-09 · 📡 eess.SY · cs.AI· cs.SY

Recognition: 2 theorem links

· Lean Theorem

Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 📡 eess.SY cs.AIcs.SY
keywords Agentic AIenergy efficiencynetworking-aware inferencePerception-Reasoning-Action cyclecross-layer co-designedge computingLLM inference optimizationautonomous systems
0
0 comments X

The pith

Agentic AI inference faces compounding computational and communication energy costs that an accounting framework and unified taxonomy can organize for cross-layer optimizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines how Agentic AI, which runs perception, reasoning, and action in continuous loops for autonomous adaptation, creates new energy demands beyond those of standard large language models. It shows that iterative inference plus persistent data exchange over networks produces both computational and communication costs that compound in mobile and edge settings. The authors propose an energy accounting framework to identify these costs across the full cycle and build a taxonomy covering model simplification, computation control, input and attention optimization, and hardware-aware inference. They further explore cross-layer co-design that jointly tunes model parameters, wireless transmissions, and edge resources. The work supplies a roadmap toward scalable, energy-efficient autonomous systems in networked environments.

Core claim

In this survey, we propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference. We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources.

What carries the argument

Energy accounting framework that tracks computational and communication costs through the Perception-Reasoning-Action cycle of Agentic AI, organized by a unified taxonomy of optimization techniques.

If this is right

  • Cross-layer co-design enables joint optimization of model parameters, wireless transmissions, and edge resources to lower total energy use in mobile edge computing and autonomous systems.
  • The taxonomy organizes existing techniques so that model simplification and hardware-aware inference can be applied together with input optimization for iterative loops.
  • Identification of open challenges in federated green learning and carbon-aware agency points toward future directions for self-sustaining Agentic AI.
  • The framework supports development of 6G-native Agentic AI by linking inference energy accounting directly to networking constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same accounting approach could be tested on non-LLM agentic pipelines to check whether communication costs remain dominant when perception involves sensor fusion rather than text.
  • Extending the framework to include carbon-intensity signals from the grid could turn the taxonomy into a tool for carbon-aware scheduling of inference tasks.
  • Applying the cross-layer ideas to multi-agent swarms might reveal new coordination overheads not captured in single-agent surveys.

Load-bearing premise

Computational and communication energy costs dominate in Agentic AI, and a single unified taxonomy plus cross-layer co-design can organize and solve the problem without missing major trade-offs or requiring entirely new categories.

What would settle it

A concrete demonstration that a substantial energy cost or trade-off in Agentic AI inference falls outside the four taxonomy categories or cannot be addressed by the surveyed cross-layer co-design approaches.

Figures

Figures reproduced from arXiv: 2604.07857 by Dusit Niyato, Haiqi Yu, Ruichen Zhang, Shugong Xu, Shunqing Zhang, Wei Ni, Xiaojing Chen, Xin Wang.

Figure 1
Figure 1. Figure 1: Survey organization: Section 2 introduces Agentic AI concepts and energy accounting; Section 3 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Generic framework illustrating how Edge Agentic AI autonomously integrates multimodal observations, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of energy efficient optimization methods. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of integrated wireless-edge intelligence for sustainable Agentic AI. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
read the original abstract

The rapid emergence of Large Language Models (LLMs) has catalyzed Agentic artificial intelligence (AI), autonomous systems integrating perception, reasoning, and action into closed-loop pipelines for continuous adaptation. While unlocking transformative applications in mobile edge computing, autonomous systems, and next-generation wireless networks, this paradigm creates fundamental energy challenges through iterative inference and persistent data exchange. Unlike traditional AI where bottlenecks are computational Floating Point Operations (FLOPs), Agentic AI faces compounding computational and communication energy costs. In this survey, we propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference. We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources. Finally, we identify open challenges of federated green learning, carbon-aware agency, 6th generation mobile communication (6G)-native Agentic AI, and self-sustaining systems, providing a roadmap for scalable autonomous intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. This survey proposes an energy accounting framework that identifies both computational and communication energy costs across the Perception-Reasoning-Action cycle in Agentic AI. It introduces a unified taxonomy organized into four areas—model simplification, computation control, input and attention optimization, and hardware-aware inference—while examining cross-layer co-design strategies that jointly optimize model parameters, wireless transmissions, and edge resources. The paper concludes by outlining open challenges in federated green learning, carbon-aware agency, 6G-native Agentic AI, and self-sustaining systems.

Significance. If the literature synthesis is comprehensive, the framework and taxonomy provide a timely organizing structure for the emerging intersection of Agentic AI and networked systems, where communication energy costs compound iterative inference. The cross-layer perspective and explicit roadmap of open challenges could help guide research in mobile edge computing and wireless AI, areas where pure compute-centric optimizations are insufficient.

minor comments (2)
  1. [Taxonomy section] The taxonomy is introduced as spanning four categories, but without an accompanying summary table or diagram that maps representative techniques and cited works to each category, readers may find it difficult to quickly assess coverage and relationships between areas.
  2. [Energy accounting framework] The energy accounting framework description would benefit from a concrete example or pseudocode illustrating how computational FLOPs and communication bits are combined into a total energy metric for a sample Perception-Reasoning-Action loop.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript's significance and for recommending minor revision. The summary accurately captures the core contributions of our energy accounting framework, taxonomy, and cross-layer co-design perspective for Agentic AI inference.

Circularity Check

0 steps flagged

No significant circularity in survey synthesis

full rationale

This is a survey paper proposing an energy accounting framework and unified taxonomy for Agentic AI by synthesizing cited literature across model simplification, computation control, input/attention optimization, and hardware-aware inference, plus cross-layer co-design. No mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems are asserted that could reduce by construction to the paper's own inputs or self-citations. The central claims are descriptive organization of external work, rendering the argument self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the central contributions rest on the authors' synthesis of prior literature on AI energy optimization and wireless networking. No free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5506 in / 1256 out tokens · 56136 ms · 2026-05-10T17:33:00.840596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

148 extracted references · 73 canonical work pages · 1 internal anchor

  1. [1]

    Rishabh Agrawal, Himanshu Kumar, and Shashikant Reddy Lnu. 2025. Efficient LLMs for edge devices: Pruning, quantization, and distillation techniques. In2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS). IEEE, 1413–1418

  2. [2]

    Jin-Hyun Ahn, Osvaldo Simeone, and Joonhyuk Kang. 2020. Wireless Federated Distillation for Distributed Edge Learning with Heterogeneous Data.IEEE Transactions on Wireless Communications19, 11 (2020), 7130–7144

  3. [3]

    Mo Ahtasam. 2025. DOL-LLM-Optimizing Large Language Model Inference with Domain-Specific Adaptations and Efficiency Techniques via Quantization, Pruning, and Distillation.Authorea Preprints(2025)

  4. [4]

    Keivan Alizadeh, Iman Mirzadeh, Hooman Shahrokhi, Dmitry Belenko, Chenfan Sun, Minsik Cho, Mohammad Sekhavat, Moin Nabi, and Mehrdad Farajtabar. 2024. Duo-LLM: A Framework for Studying Adaptive Computation in ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey x:29...

  5. [5]

    Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12562–12584

  6. [6]

    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-Based Adaptive Structured Pruning for Large Language Models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

  7. [7]

    Sudharshana B, Nandhini V, and AkilaGandhi G S ME. 2025. A Comprehensive Review of LLM Neural Network Enhancements for Advanced Driving Assistance Systems Through Quantization. In2025 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC). 1–7. doi:10.1109/ASSIC64892.2025.11158258

  8. [8]

    Molisch, and Joongheon Kim

    Hankyul Baek, Gyu Seon Kim, Soohyun Park, Andreas F. Molisch, and Joongheon Kim. 2025. Slimmable Federated Reinforcement Learning for Energy-Efficient Proactive Caching.IEEE Transactions on Networking33, 4 (2025), 2079–2094. doi:10.1109/TON.2025.3554608

  9. [9]

    Tong Bai, Bohan Huang, Zichuan Xu, Bo Hou, Haoran Zhao, and Zhipeng Wang. 2025. Adaptive Feature Compression and Resource Scheduling for End-Edge Co-Inference.IEEE Internet of Things Journal12, 18 (2025), 37255–37270. doi:10.1109/JIOT.2025.3582220

  10. [10]

    Krishna Bajpai and Vedanshi Gupta. 2025. EcoLLM: A Joint Optimization Framework for Ultra-Low Power, Mixed- Precision LLM Inference on Resource-Constrained Edge Systems.Authorea Preprints(2025)

  11. [11]

    Pedram Bakhtiariifard, Christian Igel, and Raghavendra Selvan. 2024. EC-NAS: Energy Consumption Aware Tabular Benchmarks for Neural Architecture Search. InICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing. 5660–5664

  12. [12]

    Rui Bao, Nan Xue, Yaping Sun, and Zhiyong Chen. 2025. Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks. In2025 IEEE/CIC International Conference on Communications in China (ICCC Workshops). IEEE. doi:10.1109/ICCCWorkshops67136.2025.11147210

  13. [13]

    S Bhardwaj, P Singh, and M K Pandit. 2024. A Survey on the Integration and Optimization of Large Language Models in Edge Computing Environments. In2024 16th International Conference on Computer and Automation Engineering (ICCAE). 168–172

  14. [14]

    Parag Biswas, Abdur Rashid, Angona Biswas, Md Abdullah Al Nasim, Sovon Chakraborty, Kishor Datta Gupta, and Roy George. 2024. AI-driven approaches for optimizing power consumption: A comprehensive survey.Discover Artificial Intelligence4, 116 (2024)

  15. [15]

    Marcello Bullo, Seifallah Jardak, Pietro Carnelli, and Deniz Gündüz. 2024. Energy-Aware Dynamic Neural Inference. arXiv preprint arXiv:2411.02471(2024)

  16. [16]

    Mohak Chadha, Thandayuthapani Subramanian, Eishi Arima, Michael Gerndt, Martin Schulz, and Osama Abboud

  17. [17]

    InProceedings of the 9th International Workshop on Serverless Computing

    Greencourier: Carbon-aware scheduling for serverless functions. InProceedings of the 9th International Workshop on Serverless Computing. 18–23

  18. [18]

    Xiaojing Chen, Si Chen, Wei Ni, Xin Wang, Sihai Zhang, Shunqing Zhang, Yanzan Sun, Shugong Xu, and Abbas Jamalipour. 2024. Optimal Two-Timescale Configuration of Mobile Edge Computing With Mixed Energy Supply. IEEE Transactions on Smart Grid15, 5 (2024), 4765–4778. doi:10.1109/TSG.2024.3390772

  19. [19]

    Xiaojing Chen, Zhuoxiao Chen, Wei Ni, Zhenxu Bai, and Shunqing Zhang. 2024. Joint User Association and Resource Allocation for Smart-Grid-Powered Wireless Networks Under Constrained Carbon Emission.IEEE Wireless Communications Letters13, 11 (2024), 3217–3221. doi:10.1109/LWC.2024.3459010

  20. [20]

    Xiaojing Chen, Yijun Ding, Wei Ni, Xin Wang, Yichuang Sun, and Shunqing Zhang. 2025. Towards Dynamic Energy/Carbon Trading and Resource Allocation for MEC: A Two-Timescale Deep Reinforcement Learning Approach. In2025 IEEE/CIC International Conference on Communications in China (ICCC). 1–6. doi:10.1109/ICCC65529.2025. 11148917

  21. [21]

    Xiaojing Chen, Zhenyuan Li, Wei Ni, Xin Wang, Shunqing Zhang, Yanzan Sun, Shugong Xu, and Qingqi Pei. 2024. Toward Dynamic Resource Allocation and Client Scheduling in Hierarchical Federated Learning: A Two-Phase Deep Reinforcement Learning Approach.IEEE Trans. Commun.72, 12 (2024), 7798–7813. doi:10.1109/TCOMM.2024.3420733

  22. [22]

    Xiaojing Chen, Hanfei Wen, Wei Ni, Shunqing Zhang, Xin Wang, Shugong Xu, and Qingqi Pei. 2022. Distributed Online Optimization of Edge Computing With Mixed Power Supply of Renewable Energy and Smart Grid.IEEE Transactions on Communications70, 1 (2022), 389–403. doi:10.1109/TCOMM.2021.3123275

  23. [23]

    Yuxuan Chen, Rongpeng Li, Xiaoxue Yu, Zhifeng Zhao, and Honggang Zhang. 2025. Adaptive layer splitting for wireless large language model inference in edge computing: A model-based reinforcement learning approach.Frontiers of Information Technology & Electronic Engineering26, 2 (2025), 278–292. doi:10.1631/FITEE.2400468

  24. [24]

    Han Cho, Apurba Prasad Padhy, Fernando Camacho, and Saibal Mukhopadhyay. 2025. Sub 4-bit Power-of-Two Based Mixed-Precision Quantization for Efficient LLM Compression and Acceleration.IEEE Access(2025), 1–1. ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. x:30 Y. Liu et al. doi:10.1109/ACCESS.2025.3625771

  25. [25]

    Xuan-Toan Dang, Binh-Minh Vu, Quynh-Suong Nguyen, Thi-Thuy-Minh Tran, Joon-Soo Eom, and Oh-Soon Shin

  26. [26]

    doi:10.3390/en17246485

    A Survey on Energy-Efficient Design for Federated Learning over Wireless Networks.Energies17, 24 (2024). doi:10.3390/en17246485

  27. [27]

    Dantas, Lucas C

    Pierre V. Dantas, Lucas C. Cordeiro, and Waldir S. S. Junior. 2025. A review of state-of-the-art techniques for large language model compression.Complex & Intelligent Systems11, 407 (2025)

  28. [28]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InProceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS)

  29. [29]

    Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukherjee. 2023. SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference.arXiv preprint arXiv:2307.02628(2023)

  30. [30]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized Large Language Models. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)

  31. [31]

    Hongyang Du, Zehui Li, Dusit Niyato, Jiawen Kang, Zehui Xiong, Xuemin Shen, and Dong In Kim. 2024. Enabling AI-Generated Content Services in Wireless Edge Networks.IEEE Wireless Communications31, 3 (2024), 226–234

  32. [32]

    Kiannah Foster, Andrew Johansson, Elizabeth Williams, Daniel Petrovic, and Nicholas Kovalenko. 2024. A Token- Agnostic Approach to Controlling Generated Text Length in Large Language Models.Research Square(2024). doi:10.21203/rs.3.rs-5204102/v1 Preprint

  33. [33]

    Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot. InProceedings of the 40th International Conference on Machine Learning (ICML). PMLR

  34. [34]

    Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh

    Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models. InProceedings of the 30th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’25). doi:10.1145/3710848.3710871

  35. [35]

    Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, and Yen-Chang Hsu. 2024. DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)

  36. [36]

    Jay Gorvadiya, Ankur Chagela, and Mohendra Roy. 2025. Energy efficient pruning and quantization methods for deep learning models. In2025 International Conference on Sustainable Energy Technologies and Computational Intelligence (SETCOM). IEEE, 1–6

  37. [37]

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543(2024)

  38. [38]

    Shuaishuai Guo, Yanhu Wang, Shujing Li, and Nasir Saeed. 2023. Semantic Importance-Aware Communications Using Pre-Trained Language Models.IEEE Communications Letters27, 9 (2023), 2328–2332. doi:10.1109/LCOMM.2023.3293805

  39. [39]

    Sama Habibi and Ozgur Ercetin. 2025. Edge-LLM Inference With Cost-Aware Layer Allocation and Adaptive Scheduling.IEEE Access13 (2025), 131614–131637. doi:10.1109/ACCESS.2025.3592308

  40. [40]

    Siem Hadish, Maher Guizani, Moayad Aloqaily, and Latif U. Khan. 2025. Transformer Based Architecture for Smart Grid Energy Consumption Forecasting. In2025 International Wireless Communications and Mobile Computing (IWCMC). 1726–1731. doi:10.1109/IWCMC65282.2025.11059615

  41. [41]

    Abdul Hannan, Daniele Falavigna, and Alessio Brutti. 2025. Input Conditioned Layer Dropping in Speech Foundation Models. In2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP). 1–6. doi:10.1109/ MLSP62443.2025.11204255

  42. [42]

    Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, and Ting Cao. 2024. Hybrid SLM and LLM for Edge-Cloud Collaborative Inference. InProceedings of the Workshop on Edge and Mobile Foundation Models and Workshop on Mobile Computing with Large Language Models (EdgeFM ’24). ACM, 1–6. doi:10.1145/3662006.3662067

  43. [43]

    Zeqi Hao, Guoqing Xu, Yun Luo, Heng Hu, Jianping An, and Shiwen Mao. 2023. Multi-Agent Collaborative Inference via DNN Decoupling: Intermediate Feature Compression and Edge Learning.IEEE Transactions on Mobile Computing 22, 10 (2023), 6041–6055

  44. [44]

    Richard Yu, and Victor C

    Ying He, Jingcheng Fang, F. Richard Yu, and Victor C. Leung. 2024. Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach.IEEE Transactions on Mobile Computing23, 12 (2024), 11253–11266. doi:10.1109/TMC.2024.3415661

  45. [45]

    Hennessy and David A

    John L. Hennessy and David A. Patterson. 2017.Computer Architecture: A Quantitative Approach(6 ed.). Morgan Kaufmann, Cambridge, MA, USA

  46. [46]

    Miao Hu, Qi He, and Di Wu. 2025. QLLMS: Quantization-Adaptive LLM Scheduling for Partially Informed Edge Serving Systems. InProceedings of IEEE INFOCOM. doi:10.1109/INFOCOM55648.2025.11044591 ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey x:31

  47. [47]

    Shuyan Hu, Xiaojing Chen, Wei Ni, Xin Wang, and Ekram Hossain. 2020. Modeling and Analysis of Energy Harvesting and Smart Grid-Powered Wireless Communication Networks: A Contemporary Survey.IEEE Transactions on Green Communications and Networking4, 2 (2020), 461–496. doi:10.1109/TGCN.2020.2988270

  48. [48]

    E J Husom, A Goknil, M Astekin, L K Shar, A Kåsen, S Sen, B A Mithassel, and A Soylu. 2025. Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency. ACM Transactions on Internet of Things(2025)

  49. [49]

    Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, and Saravan Rajmohan. 2025. Performance Aware LLM Load Balancer for Mixed Workloads. InProceedings of the 5th Workshop on Machine Learning and Systems (EuroMLSys ’25). ACM, 19–30. doi:10.1145...

  50. [50]

    Metod Jazbec, Patrick Forré, Stephan Mandt, Dan Zhang, and Eric Nalisnick. 2024. Early-Exit Neural Networks with Nested Prediction Sets.arXiv preprint arXiv:2311.05931(2024). Accepted at the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)

  51. [51]

    Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, and Murali Annavaram. 2025. KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation. InFindings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, 19474–19488. https://github.com/chaoyij/KVPR

  52. [52]

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 13358–13376

  53. [53]

    Yikun Jiang, Huanyu Wang, Lei Xie, Hanbin Zhao, Chao Zhang, Hui Qian, and John C.S. Lui. 2024. D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS). NeurIPS

  54. [54]

    Hongpeng Jin and Yanzhao Wu. 2025. CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud- Edge Collaboration. In2025 IEEE International Conference on Web Services (ICWS). 316–323. doi:10.1109/ICWS67624. 2025.00046

  55. [55]

    Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, and Dimitrios Soudris. 2025. ThrottLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1363–1378

  56. [56]

    Andreas Kosmas Kakolyris, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. 2024. SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving.IEEE Computer Architecture Letters23, 2 (2024), 150–153. doi:10. 1109/LCA.2024.3406038

  57. [57]

    Christopher Keith, Michael Robinson, Francis Duncan, Allan Worthington, Joseph Wilson, and Sofia Harris. 2024. Optimizing Large Language Models: A Novel Approach Through Dynamic Token Pruning.Research Square(2024). doi:10.21203/rs.3.rs-5293588/v1

  58. [58]

    Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. 2024. SwapMoE: Serving off-the-shelf MoE-based large language models with tunable memory budget. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6710–6720

  59. [59]

    Sravani Kurma, Anal Paul, Keshav Singh, Kapal Dev, and Chih-Peng Li. 2025. LLMs for Resource Allocation in Next- Gen RIS-Aided Healthcare Wireless Networks. InIEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 1–6. doi:10.1109/INFOCOMWKSHPS65812.2025.11152831

  60. [60]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23). ACM, 611–626. doi:10.1145/ 3600006.3613165

  61. [61]

    Lei Lei, Yaxiong Yuan, Yu Zhou, Yang Yang, Yu Luo, Lina Pu, and Symeon Chatzinotas. 2024. Energy Optimization and Lightweight Design for Efficient Federated Learning in Wireless Edge Systems.IEEE Transactions on Vehicular Technology73, 9 (2024), 13542–13557

  62. [62]

    Hanxi Li, Guorong Chen, Bin Wang, Zheng Chen, Yongsheng Zhu, Fuqiang Hu, Jiao Dai, and Wei Wang. 2025. PFedKD: Personalized Federated Learning via Knowledge Distillation Using Unlabeled Pseudo Data for Internet of Things.IEEE Internet of Things Journal12, 11 (June 2025), 16314–16327. doi:10.1109/JIOT.2025.3533003

  63. [63]

    Jinrong Li, Biao Han, Sudan Li, Xiaoyan Wang, and Jie Li. 2024. CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices. InProceedings of the IEEE/CIC International Conference on Communications in China (ICCC). doi:10.1109/ICCC62479.2024.10681712

  64. [64]

    Shiyao Li, Xuefei Ning, Ke Hong, Tengxuan Liu, Luning Wang, Xiuhong Li, Kai Zhong, Guohao Dai, Huazhong Yang, and Yu Wang. 2023. LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment. InNeurIPS 2023 Workshop on Efficient Natural Language and Speech Processing. ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. x:32 Y. Liu et al

  65. [65]

    Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems.Proceedings of the ACM on Management of Data (SIGMOD)3, 4 (2025), Article 250. doi:10.1145/3749168

  66. [66]

    Zonghang Li, Wenjiao Feng, Mohsen Guizani, and Hongfang Yu. 2025. TPI-LLM: Serving 70B-Scale LLMs Efficiently on Low-Resource Mobile Devices.IEEE Transactions on Services Computing18, 5 (2025), 3321–3333. doi:10.1109/TSC. 2025.3596892

  67. [67]

    Chengsi Liang, Hongyang Du, Yao Sun, Dusit Niyato, Jiawen Kang, Dezong Zhao, and Muhammad Ali Imran. 2025. Generative AI-Driven Semantic Communication Networks: Architecture, Technologies, and Applications.IEEE Transactions on Cognitive Communications and Networking11, 1 (2025), 27–47. doi:10.1109/TCCN.2024.3435524

  68. [68]

    Gui Ling, Ziyang Wang, Yuliang Yan, and Qingwen Liu. 2024. SlimGPT: Layer-wise Structured Pruning for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)

  69. [69]

    Dong Liu and Yanxuan Yu. 2025. TinyServe: Query-Aware Cache Selection for Efficient LLM Serving. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). ACM, 12528–12536. doi:10.1145/3746027.3758181

  70. [70]

    Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, and Chao Li

  71. [71]

    A Survey on Inference Optimization Techniques for Mixture of Experts Models.arXiv preprint arXiv:2412.14219 (2024)

  72. [72]

    Shu Liu, Dingzhu Wen, Da Li, Qimei Chen, Guangxu Zhu, and Yuanming Shi. 2024. Energy-Efficient Optimal Mode Selection for Edge AI Inference via Integrated Sensing-Communication-Computation.IEEE Transactions on Mobile Computing23, 12 (2024), 14248–14262. doi:10.1109/TMC.2024.3440581

  73. [73]

    Sicong Liu, Weiye Wu, Xiangrui Xu, Teng Li, Bowen Pang, Bin Guo, and Zhiwen Yu. 2025. Adaptive and Resource- efficient Agentic AI Systems for Mobile and Embedded Devices: A Survey. arXiv:2510.00078 [cs.LG] https://arxiv. org/abs/2510.00078

  74. [74]

    Yuxuan Liu. 2024. Learning to Reason with Autoregressive In-Context Distillation. InProceedings of the International Conference on Learning Representations (ICLR), Tiny Papers Track

  75. [75]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference. ACM, 1–18. doi:10....

  76. [76]

    Zhang Liu, Hongyang Du, Lianfen Huang, Zhibin Gao, and Dusit Niyato. 2025. Joint Model Caching and Resource Allocation in Generative AI-Enabled Wireless Edge Networks. In2025 IEEE Wireless Communications and Networking Conference (WCNC). 1–6. doi:10.1109/WCNC61545.2025.10978225

  77. [77]

    Zhihao Liu, Xianliang Yang, Zichuan Liu, Yifan Xia, Wei Jiang, Yuanyu Zhang, Lijuan Li, Guoliang Fan, Lei Song, and Bian Jiang. 2024. Knowing what not to do: Leverage language model insights for action space pruning in multi-agent reinforcement learning.arXiv preprint arXiv:2405.16854(2024)

  78. [78]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. InInternational Conference on Machine Learning (ICML)

  79. [79]

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, and Vikas Chandra. 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. InInternational Conference on Machine Learning (ICML)

  80. [80]

    Jianlan Luo, Perry Dong, Jeffrey Wu, Aviral Kumar, Xinyang Geng, and Sergey Levine. 2023. Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning. InProceedings of the 7th Conference on Robot Learning (CoRL). Atlanta, USA. https://saqrl.github.io

Showing first 80 references.