arxiv: 2604.07857 · v1 · submitted 2026-04-09 · 📡 eess.SY · cs.AI· cs.SY

Recognition: 2 theorem links

· Lean Theorem

Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey

Xiaojing Chen , Haiqi Yu , Wei Ni , Dusit Niyato , Ruichen Zhang , Xin Wang , Shunqing Zhang , Shugong Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3

classification 📡 eess.SY cs.AIcs.SY

keywords Agentic AIenergy efficiencynetworking-aware inferencePerception-Reasoning-Action cyclecross-layer co-designedge computingLLM inference optimizationautonomous systems

0 comments

The pith

Agentic AI inference faces compounding computational and communication energy costs that an accounting framework and unified taxonomy can organize for cross-layer optimizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines how Agentic AI, which runs perception, reasoning, and action in continuous loops for autonomous adaptation, creates new energy demands beyond those of standard large language models. It shows that iterative inference plus persistent data exchange over networks produces both computational and communication costs that compound in mobile and edge settings. The authors propose an energy accounting framework to identify these costs across the full cycle and build a taxonomy covering model simplification, computation control, input and attention optimization, and hardware-aware inference. They further explore cross-layer co-design that jointly tunes model parameters, wireless transmissions, and edge resources. The work supplies a roadmap toward scalable, energy-efficient autonomous systems in networked environments.

Core claim

In this survey, we propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference. We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources.

What carries the argument

Energy accounting framework that tracks computational and communication costs through the Perception-Reasoning-Action cycle of Agentic AI, organized by a unified taxonomy of optimization techniques.

If this is right

Cross-layer co-design enables joint optimization of model parameters, wireless transmissions, and edge resources to lower total energy use in mobile edge computing and autonomous systems.
The taxonomy organizes existing techniques so that model simplification and hardware-aware inference can be applied together with input optimization for iterative loops.
Identification of open challenges in federated green learning and carbon-aware agency points toward future directions for self-sustaining Agentic AI.
The framework supports development of 6G-native Agentic AI by linking inference energy accounting directly to networking constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same accounting approach could be tested on non-LLM agentic pipelines to check whether communication costs remain dominant when perception involves sensor fusion rather than text.
Extending the framework to include carbon-intensity signals from the grid could turn the taxonomy into a tool for carbon-aware scheduling of inference tasks.
Applying the cross-layer ideas to multi-agent swarms might reveal new coordination overheads not captured in single-agent surveys.

Load-bearing premise

Computational and communication energy costs dominate in Agentic AI, and a single unified taxonomy plus cross-layer co-design can organize and solve the problem without missing major trade-offs or requiring entirely new categories.

What would settle it

A concrete demonstration that a substantial energy cost or trade-off in Agentic AI inference falls outside the four taxonomy categories or cannot be addressed by the surveyed cross-layer co-design approaches.

Figures

Figures reproduced from arXiv: 2604.07857 by Dusit Niyato, Haiqi Yu, Ruichen Zhang, Shugong Xu, Shunqing Zhang, Wei Ni, Xiaojing Chen, Xin Wang.

**Figure 2.** Figure 2: Generic framework illustrating how Edge Agentic AI autonomously integrates multimodal observations, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of energy efficient optimization methods. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: An overview of integrated wireless-edge intelligence for sustainable Agentic AI. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

The rapid emergence of Large Language Models (LLMs) has catalyzed Agentic artificial intelligence (AI), autonomous systems integrating perception, reasoning, and action into closed-loop pipelines for continuous adaptation. While unlocking transformative applications in mobile edge computing, autonomous systems, and next-generation wireless networks, this paradigm creates fundamental energy challenges through iterative inference and persistent data exchange. Unlike traditional AI where bottlenecks are computational Floating Point Operations (FLOPs), Agentic AI faces compounding computational and communication energy costs. In this survey, we propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference. We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources. Finally, we identify open challenges of federated green learning, carbon-aware agency, 6th generation mobile communication (6G)-native Agentic AI, and self-sustaining systems, providing a roadmap for scalable autonomous intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes techniques for energy-efficient agentic AI in networks but offers no new experiments or validations.

read the letter

This paper is a survey that sets up an energy accounting framework for agentic AI and creates a taxonomy of techniques for cutting energy use in networked inference. The authors focus on how agentic systems, which run continuous loops of perception, reasoning, and action, rack up both compute and communication energy unlike one-shot inference. The new part is the way they link computational costs with communication costs across that cycle. They organize methods into four areas: model simplification, computation control, input and attention optimization, and hardware-aware inference. They also discuss cross-layer co-design that combines model tweaks with wireless resource allocation and edge computing choices. This pulls together a lot of scattered work on efficient LLMs, pruning, quantization, and networking optimizations. It does well at identifying open issues such as federated green learning, carbon-aware agency, 6G-native agentic AI, and self-sustaining systems. These point to future directions that combine sustainability with autonomy. The main limitation is that it stays at the level of synthesis. No new data or tests back up the framework, so we cannot tell yet if the taxonomy is complete or if it misses key interactions in real wireless environments, like latency-energy trade-offs under varying channel conditions. Coverage of the literature seems broad but would need checking for balance across different sub-areas. Readers working on energy-efficient AI for wireless or autonomous systems will find this a useful starting point. It gives structure to a subfield that is growing quickly with the rise of LLMs in mobile applications. I would send it for peer review. Referees can verify the taxonomy's utility and the thoroughness of the references. A good review would also ask whether the framework leads to actionable design guidelines or remains high-level.

Referee Report

0 major / 2 minor

Summary. This survey proposes an energy accounting framework that identifies both computational and communication energy costs across the Perception-Reasoning-Action cycle in Agentic AI. It introduces a unified taxonomy organized into four areas—model simplification, computation control, input and attention optimization, and hardware-aware inference—while examining cross-layer co-design strategies that jointly optimize model parameters, wireless transmissions, and edge resources. The paper concludes by outlining open challenges in federated green learning, carbon-aware agency, 6G-native Agentic AI, and self-sustaining systems.

Significance. If the literature synthesis is comprehensive, the framework and taxonomy provide a timely organizing structure for the emerging intersection of Agentic AI and networked systems, where communication energy costs compound iterative inference. The cross-layer perspective and explicit roadmap of open challenges could help guide research in mobile edge computing and wireless AI, areas where pure compute-centric optimizations are insufficient.

minor comments (2)

[Taxonomy section] The taxonomy is introduced as spanning four categories, but without an accompanying summary table or diagram that maps representative techniques and cited works to each category, readers may find it difficult to quickly assess coverage and relationships between areas.
[Energy accounting framework] The energy accounting framework description would benefit from a concrete example or pseudocode illustrating how computational FLOPs and communication bits are combined into a total energy metric for a sample Perception-Reasoning-Action loop.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript's significance and for recommending minor revision. The summary accurately captures the core contributions of our energy accounting framework, taxonomy, and cross-layer co-design perspective for Agentic AI inference.

Circularity Check

0 steps flagged

No significant circularity in survey synthesis

full rationale

This is a survey paper proposing an energy accounting framework and unified taxonomy for Agentic AI by synthesizing cited literature across model simplification, computation control, input/attention optimization, and hardware-aware inference, plus cross-layer co-design. No mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems are asserted that could reduce by construction to the paper's own inputs or self-citations. The central claims are descriptive organization of external work, rendering the argument self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the central contributions rest on the authors' synthesis of prior literature on AI energy optimization and wireless networking. No free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5506 in / 1256 out tokens · 56136 ms · 2026-05-10T17:33:00.840596+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

148 extracted references · 73 canonical work pages · 1 internal anchor

[1]

Rishabh Agrawal, Himanshu Kumar, and Shashikant Reddy Lnu. 2025. Efficient LLMs for edge devices: Pruning, quantization, and distillation techniques. In2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS). IEEE, 1413–1418

2025
[2]

Jin-Hyun Ahn, Osvaldo Simeone, and Joonhyuk Kang. 2020. Wireless Federated Distillation for Distributed Edge Learning with Heterogeneous Data.IEEE Transactions on Wireless Communications19, 11 (2020), 7130–7144

2020
[3]

Mo Ahtasam. 2025. DOL-LLM-Optimizing Large Language Model Inference with Domain-Specific Adaptations and Efficiency Techniques via Quantization, Pruning, and Distillation.Authorea Preprints(2025)

2025
[4]

Keivan Alizadeh, Iman Mirzadeh, Hooman Shahrokhi, Dmitry Belenko, Chenfan Sun, Minsik Cho, Mohammad Sekhavat, Moin Nabi, and Mehrdad Farajtabar. 2024. Duo-LLM: A Framework for Studying Adaptive Computation in ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey x:29...

2024
[5]

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12562–12584

2024
[6]

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-Based Adaptive Structured Pruning for Large Language Models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

2024
[7]

Sudharshana B, Nandhini V, and AkilaGandhi G S ME. 2025. A Comprehensive Review of LLM Neural Network Enhancements for Advanced Driving Assistance Systems Through Quantization. In2025 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC). 1–7. doi:10.1109/ASSIC64892.2025.11158258

work page doi:10.1109/assic64892.2025.11158258 2025
[8]

Molisch, and Joongheon Kim

Hankyul Baek, Gyu Seon Kim, Soohyun Park, Andreas F. Molisch, and Joongheon Kim. 2025. Slimmable Federated Reinforcement Learning for Energy-Efficient Proactive Caching.IEEE Transactions on Networking33, 4 (2025), 2079–2094. doi:10.1109/TON.2025.3554608

work page doi:10.1109/ton.2025.3554608 2025
[9]

Tong Bai, Bohan Huang, Zichuan Xu, Bo Hou, Haoran Zhao, and Zhipeng Wang. 2025. Adaptive Feature Compression and Resource Scheduling for End-Edge Co-Inference.IEEE Internet of Things Journal12, 18 (2025), 37255–37270. doi:10.1109/JIOT.2025.3582220

work page doi:10.1109/jiot.2025.3582220 2025
[10]

Krishna Bajpai and Vedanshi Gupta. 2025. EcoLLM: A Joint Optimization Framework for Ultra-Low Power, Mixed- Precision LLM Inference on Resource-Constrained Edge Systems.Authorea Preprints(2025)

2025
[11]

Pedram Bakhtiariifard, Christian Igel, and Raghavendra Selvan. 2024. EC-NAS: Energy Consumption Aware Tabular Benchmarks for Neural Architecture Search. InICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing. 5660–5664

2024
[12]

Rui Bao, Nan Xue, Yaping Sun, and Zhiyong Chen. 2025. Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks. In2025 IEEE/CIC International Conference on Communications in China (ICCC Workshops). IEEE. doi:10.1109/ICCCWorkshops67136.2025.11147210

work page doi:10.1109/icccworkshops67136.2025.11147210 2025
[13]

S Bhardwaj, P Singh, and M K Pandit. 2024. A Survey on the Integration and Optimization of Large Language Models in Edge Computing Environments. In2024 16th International Conference on Computer and Automation Engineering (ICCAE). 168–172

2024
[14]

Parag Biswas, Abdur Rashid, Angona Biswas, Md Abdullah Al Nasim, Sovon Chakraborty, Kishor Datta Gupta, and Roy George. 2024. AI-driven approaches for optimizing power consumption: A comprehensive survey.Discover Artificial Intelligence4, 116 (2024)

2024
[15]

Marcello Bullo, Seifallah Jardak, Pietro Carnelli, and Deniz Gündüz. 2024. Energy-Aware Dynamic Neural Inference. arXiv preprint arXiv:2411.02471(2024)

work page arXiv 2024
[16]

Mohak Chadha, Thandayuthapani Subramanian, Eishi Arima, Michael Gerndt, Martin Schulz, and Osama Abboud
[17]

InProceedings of the 9th International Workshop on Serverless Computing

Greencourier: Carbon-aware scheduling for serverless functions. InProceedings of the 9th International Workshop on Serverless Computing. 18–23
[18]

Xiaojing Chen, Si Chen, Wei Ni, Xin Wang, Sihai Zhang, Shunqing Zhang, Yanzan Sun, Shugong Xu, and Abbas Jamalipour. 2024. Optimal Two-Timescale Configuration of Mobile Edge Computing With Mixed Energy Supply. IEEE Transactions on Smart Grid15, 5 (2024), 4765–4778. doi:10.1109/TSG.2024.3390772

work page doi:10.1109/tsg.2024.3390772 2024
[19]

Xiaojing Chen, Zhuoxiao Chen, Wei Ni, Zhenxu Bai, and Shunqing Zhang. 2024. Joint User Association and Resource Allocation for Smart-Grid-Powered Wireless Networks Under Constrained Carbon Emission.IEEE Wireless Communications Letters13, 11 (2024), 3217–3221. doi:10.1109/LWC.2024.3459010

work page doi:10.1109/lwc.2024.3459010 2024
[20]

Xiaojing Chen, Yijun Ding, Wei Ni, Xin Wang, Yichuang Sun, and Shunqing Zhang. 2025. Towards Dynamic Energy/Carbon Trading and Resource Allocation for MEC: A Two-Timescale Deep Reinforcement Learning Approach. In2025 IEEE/CIC International Conference on Communications in China (ICCC). 1–6. doi:10.1109/ICCC65529.2025. 11148917

work page doi:10.1109/iccc65529.2025 2025
[21]

Xiaojing Chen, Zhenyuan Li, Wei Ni, Xin Wang, Shunqing Zhang, Yanzan Sun, Shugong Xu, and Qingqi Pei. 2024. Toward Dynamic Resource Allocation and Client Scheduling in Hierarchical Federated Learning: A Two-Phase Deep Reinforcement Learning Approach.IEEE Trans. Commun.72, 12 (2024), 7798–7813. doi:10.1109/TCOMM.2024.3420733

work page doi:10.1109/tcomm.2024.3420733 2024
[22]

Xiaojing Chen, Hanfei Wen, Wei Ni, Shunqing Zhang, Xin Wang, Shugong Xu, and Qingqi Pei. 2022. Distributed Online Optimization of Edge Computing With Mixed Power Supply of Renewable Energy and Smart Grid.IEEE Transactions on Communications70, 1 (2022), 389–403. doi:10.1109/TCOMM.2021.3123275

work page doi:10.1109/tcomm.2021.3123275 2022
[23]

Yuxuan Chen, Rongpeng Li, Xiaoxue Yu, Zhifeng Zhao, and Honggang Zhang. 2025. Adaptive layer splitting for wireless large language model inference in edge computing: A model-based reinforcement learning approach.Frontiers of Information Technology & Electronic Engineering26, 2 (2025), 278–292. doi:10.1631/FITEE.2400468

work page doi:10.1631/fitee.2400468 2025
[24]

Han Cho, Apurba Prasad Padhy, Fernando Camacho, and Saibal Mukhopadhyay. 2025. Sub 4-bit Power-of-Two Based Mixed-Precision Quantization for Efficient LLM Compression and Acceleration.IEEE Access(2025), 1–1. ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. x:30 Y. Liu et al. doi:10.1109/ACCESS.2025.3625771

work page doi:10.1109/access.2025.3625771 2025
[25]

Xuan-Toan Dang, Binh-Minh Vu, Quynh-Suong Nguyen, Thi-Thuy-Minh Tran, Joon-Soo Eom, and Oh-Soon Shin
[26]

doi:10.3390/en17246485

A Survey on Energy-Efficient Design for Federated Learning over Wireless Networks.Energies17, 24 (2024). doi:10.3390/en17246485

work page doi:10.3390/en17246485 2024
[27]

Dantas, Lucas C

Pierre V. Dantas, Lucas C. Cordeiro, and Waldir S. S. Junior. 2025. A review of state-of-the-art techniques for large language model compression.Complex & Intelligent Systems11, 407 (2025)

2025
[28]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InProceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS)

2022
[29]

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukherjee. 2023. SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference.arXiv preprint arXiv:2307.02628(2023)

work page arXiv 2023
[30]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized Large Language Models. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)

2023
[31]

Hongyang Du, Zehui Li, Dusit Niyato, Jiawen Kang, Zehui Xiong, Xuemin Shen, and Dong In Kim. 2024. Enabling AI-Generated Content Services in Wireless Edge Networks.IEEE Wireless Communications31, 3 (2024), 226–234

2024
[32]

Kiannah Foster, Andrew Johansson, Elizabeth Williams, Daniel Petrovic, and Nicholas Kovalenko. 2024. A Token- Agnostic Approach to Controlling Generated Text Length in Large Language Models.Research Square(2024). doi:10.21203/rs.3.rs-5204102/v1 Preprint

work page doi:10.21203/rs.3.rs-5204102/v1 2024
[33]

Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot. InProceedings of the 40th International Conference on Machine Learning (ICML). PMLR

2023
[34]

Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh

Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models. InProceedings of the 30th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’25). doi:10.1145/3710848.3710871

work page doi:10.1145/3710848.3710871 2025
[35]

Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, and Yen-Chang Hsu. 2024. DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)

2024
[36]

Jay Gorvadiya, Ankur Chagela, and Mohendra Roy. 2025. Energy efficient pruning and quantization methods for deep learning models. In2025 International Conference on Sustainable Energy Technologies and Computational Intelligence (SETCOM). IEEE, 1–6

2025
[37]

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543(2024)

work page internal anchor Pith review arXiv 2024
[38]

Shuaishuai Guo, Yanhu Wang, Shujing Li, and Nasir Saeed. 2023. Semantic Importance-Aware Communications Using Pre-Trained Language Models.IEEE Communications Letters27, 9 (2023), 2328–2332. doi:10.1109/LCOMM.2023.3293805

work page doi:10.1109/lcomm.2023.3293805 2023
[39]

Sama Habibi and Ozgur Ercetin. 2025. Edge-LLM Inference With Cost-Aware Layer Allocation and Adaptive Scheduling.IEEE Access13 (2025), 131614–131637. doi:10.1109/ACCESS.2025.3592308

work page doi:10.1109/access.2025.3592308 2025
[40]

Siem Hadish, Maher Guizani, Moayad Aloqaily, and Latif U. Khan. 2025. Transformer Based Architecture for Smart Grid Energy Consumption Forecasting. In2025 International Wireless Communications and Mobile Computing (IWCMC). 1726–1731. doi:10.1109/IWCMC65282.2025.11059615

work page doi:10.1109/iwcmc65282.2025.11059615 2025
[41]

Abdul Hannan, Daniele Falavigna, and Alessio Brutti. 2025. Input Conditioned Layer Dropping in Speech Foundation Models. In2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP). 1–6. doi:10.1109/ MLSP62443.2025.11204255

work page arXiv 2025
[42]

Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, and Ting Cao. 2024. Hybrid SLM and LLM for Edge-Cloud Collaborative Inference. InProceedings of the Workshop on Edge and Mobile Foundation Models and Workshop on Mobile Computing with Large Language Models (EdgeFM ’24). ACM, 1–6. doi:10.1145/3662006.3662067

work page doi:10.1145/3662006.3662067 2024
[43]

Zeqi Hao, Guoqing Xu, Yun Luo, Heng Hu, Jianping An, and Shiwen Mao. 2023. Multi-Agent Collaborative Inference via DNN Decoupling: Intermediate Feature Compression and Edge Learning.IEEE Transactions on Mobile Computing 22, 10 (2023), 6041–6055

2023
[44]

Richard Yu, and Victor C

Ying He, Jingcheng Fang, F. Richard Yu, and Victor C. Leung. 2024. Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach.IEEE Transactions on Mobile Computing23, 12 (2024), 11253–11266. doi:10.1109/TMC.2024.3415661

work page doi:10.1109/tmc.2024.3415661 2024
[45]

Hennessy and David A

John L. Hennessy and David A. Patterson. 2017.Computer Architecture: A Quantitative Approach(6 ed.). Morgan Kaufmann, Cambridge, MA, USA

2017
[46]

Miao Hu, Qi He, and Di Wu. 2025. QLLMS: Quantization-Adaptive LLM Scheduling for Partially Informed Edge Serving Systems. InProceedings of IEEE INFOCOM. doi:10.1109/INFOCOM55648.2025.11044591 ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey x:31

work page doi:10.1109/infocom55648.2025.11044591 2025
[47]

Shuyan Hu, Xiaojing Chen, Wei Ni, Xin Wang, and Ekram Hossain. 2020. Modeling and Analysis of Energy Harvesting and Smart Grid-Powered Wireless Communication Networks: A Contemporary Survey.IEEE Transactions on Green Communications and Networking4, 2 (2020), 461–496. doi:10.1109/TGCN.2020.2988270

work page doi:10.1109/tgcn.2020.2988270 2020
[48]

E J Husom, A Goknil, M Astekin, L K Shar, A Kåsen, S Sen, B A Mithassel, and A Soylu. 2025. Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency. ACM Transactions on Internet of Things(2025)

2025
[49]

Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, and Saravan Rajmohan. 2025. Performance Aware LLM Load Balancer for Mixed Workloads. InProceedings of the 5th Workshop on Machine Learning and Systems (EuroMLSys ’25). ACM, 19–30. doi:10.1145...

work page doi:10.1145/3721146.3721947 2025
[50]

Metod Jazbec, Patrick Forré, Stephan Mandt, Dan Zhang, and Eric Nalisnick. 2024. Early-Exit Neural Networks with Nested Prediction Sets.arXiv preprint arXiv:2311.05931(2024). Accepted at the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)

work page arXiv 2024
[51]

Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, and Murali Annavaram. 2025. KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation. InFindings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, 19474–19488. https://github.com/chaoyij/KVPR

2025
[52]

Huiqiang Jiang, Qianhui Wu, Chin-Yew Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 13358–13376

2023
[53]

Yikun Jiang, Huanyu Wang, Lei Xie, Hanbin Zhao, Chao Zhang, Hui Qian, and John C.S. Lui. 2024. D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS). NeurIPS

2024
[54]

Hongpeng Jin and Yanzhao Wu. 2025. CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud- Edge Collaboration. In2025 IEEE International Conference on Web Services (ICWS). 316–323. doi:10.1109/ICWS67624. 2025.00046

work page doi:10.1109/icws67624 2025
[55]

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, and Dimitrios Soudris. 2025. ThrottLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1363–1378

2025
[56]

Andreas Kosmas Kakolyris, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. 2024. SLO-Aware GPU DVFS for Energy-Efficient LLM Inference Serving.IEEE Computer Architecture Letters23, 2 (2024), 150–153. doi:10. 1109/LCA.2024.3406038

work page arXiv 2024
[57]

Christopher Keith, Michael Robinson, Francis Duncan, Allan Worthington, Joseph Wilson, and Sofia Harris. 2024. Optimizing Large Language Models: A Novel Approach Through Dynamic Token Pruning.Research Square(2024). doi:10.21203/rs.3.rs-5293588/v1

work page doi:10.21203/rs.3.rs-5293588/v1 2024
[58]

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. 2024. SwapMoE: Serving off-the-shelf MoE-based large language models with tunable memory budget. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6710–6720

2024
[59]

Sravani Kurma, Anal Paul, Keshav Singh, Kapal Dev, and Chih-Peng Li. 2025. LLMs for Resource Allocation in Next- Gen RIS-Aided Healthcare Wireless Networks. InIEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 1–6. doi:10.1109/INFOCOMWKSHPS65812.2025.11152831

work page doi:10.1109/infocomwkshps65812.2025.11152831 2025
[60]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23). ACM, 611–626. doi:10.1145/ 3600006.3613165

work page arXiv 2023
[61]

Lei Lei, Yaxiong Yuan, Yu Zhou, Yang Yang, Yu Luo, Lina Pu, and Symeon Chatzinotas. 2024. Energy Optimization and Lightweight Design for Efficient Federated Learning in Wireless Edge Systems.IEEE Transactions on Vehicular Technology73, 9 (2024), 13542–13557

2024
[62]

Hanxi Li, Guorong Chen, Bin Wang, Zheng Chen, Yongsheng Zhu, Fuqiang Hu, Jiao Dai, and Wei Wang. 2025. PFedKD: Personalized Federated Learning via Knowledge Distillation Using Unlabeled Pseudo Data for Internet of Things.IEEE Internet of Things Journal12, 11 (June 2025), 16314–16327. doi:10.1109/JIOT.2025.3533003

work page doi:10.1109/jiot.2025.3533003 2025
[63]

Jinrong Li, Biao Han, Sudan Li, Xiaoyan Wang, and Jie Li. 2024. CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices. InProceedings of the IEEE/CIC International Conference on Communications in China (ICCC). doi:10.1109/ICCC62479.2024.10681712

work page doi:10.1109/iccc62479.2024.10681712 2024
[64]

Shiyao Li, Xuefei Ning, Ke Hong, Tengxuan Liu, Luning Wang, Xiuhong Li, Kai Zhong, Guohao Dai, Huazhong Yang, and Yu Wang. 2023. LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment. InNeurIPS 2023 Workshop on Efficient Natural Language and Speech Processing. ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. x:32 Y. Liu et al

2023
[65]

Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems.Proceedings of the ACM on Management of Data (SIGMOD)3, 4 (2025), Article 250. doi:10.1145/3749168

work page doi:10.1145/3749168 2025
[66]

Zonghang Li, Wenjiao Feng, Mohsen Guizani, and Hongfang Yu. 2025. TPI-LLM: Serving 70B-Scale LLMs Efficiently on Low-Resource Mobile Devices.IEEE Transactions on Services Computing18, 5 (2025), 3321–3333. doi:10.1109/TSC. 2025.3596892

work page doi:10.1109/tsc 2025
[67]

Chengsi Liang, Hongyang Du, Yao Sun, Dusit Niyato, Jiawen Kang, Dezong Zhao, and Muhammad Ali Imran. 2025. Generative AI-Driven Semantic Communication Networks: Architecture, Technologies, and Applications.IEEE Transactions on Cognitive Communications and Networking11, 1 (2025), 27–47. doi:10.1109/TCCN.2024.3435524

work page doi:10.1109/tccn.2024.3435524 2025
[68]

Gui Ling, Ziyang Wang, Yuliang Yan, and Qingwen Liu. 2024. SlimGPT: Layer-wise Structured Pruning for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)

2024
[69]

Dong Liu and Yanxuan Yu. 2025. TinyServe: Query-Aware Cache Selection for Efficient LLM Serving. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). ACM, 12528–12536. doi:10.1145/3746027.3758181

work page doi:10.1145/3746027.3758181 2025
[70]

Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, and Chao Li
[71]

A Survey on Inference Optimization Techniques for Mixture of Experts Models.arXiv preprint arXiv:2412.14219 (2024)

work page arXiv 2024
[72]

Shu Liu, Dingzhu Wen, Da Li, Qimei Chen, Guangxu Zhu, and Yuanming Shi. 2024. Energy-Efficient Optimal Mode Selection for Edge AI Inference via Integrated Sensing-Communication-Computation.IEEE Transactions on Mobile Computing23, 12 (2024), 14248–14262. doi:10.1109/TMC.2024.3440581

work page doi:10.1109/tmc.2024.3440581 2024
[73]

Sicong Liu, Weiye Wu, Xiangrui Xu, Teng Li, Bowen Pang, Bin Guo, and Zhiwen Yu. 2025. Adaptive and Resource- efficient Agentic AI Systems for Mobile and Embedded Devices: A Survey. arXiv:2510.00078 [cs.LG] https://arxiv. org/abs/2510.00078

work page arXiv 2025
[74]

Yuxuan Liu. 2024. Learning to Reason with Autoregressive In-Context Distillation. InProceedings of the International Conference on Learning Representations (ICLR), Tiny Papers Track

2024
[75]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference. ACM, 1–18. doi:10....

work page doi:10.1145/3651890.3672274 2024
[76]

Zhang Liu, Hongyang Du, Lianfen Huang, Zhibin Gao, and Dusit Niyato. 2025. Joint Model Caching and Resource Allocation in Generative AI-Enabled Wireless Edge Networks. In2025 IEEE Wireless Communications and Networking Conference (WCNC). 1–6. doi:10.1109/WCNC61545.2025.10978225

work page doi:10.1109/wcnc61545.2025.10978225 2025
[77]

Zhihao Liu, Xianliang Yang, Zichuan Liu, Yifan Xia, Wei Jiang, Yuanyu Zhang, Lijuan Li, Guoliang Fan, Lei Song, and Bian Jiang. 2024. Knowing what not to do: Leverage language model insights for action space pruning in multi-agent reinforcement learning.arXiv preprint arXiv:2405.16854(2024)

work page arXiv 2024
[78]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. InInternational Conference on Machine Learning (ICML)

2024
[79]

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, and Vikas Chandra. 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. InInternational Conference on Machine Learning (ICML)

2024
[80]

Jianlan Luo, Perry Dong, Jeffrey Wu, Aviral Kumar, Xinyang Geng, and Sergey Levine. 2023. Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning. InProceedings of the 7th Conference on Robot Learning (CoRL). Atlanta, USA. https://saqrl.github.io

2023

Showing first 80 references.