Recognition: 2 theorem links
· Lean TheoremNetworking-Aware Energy Efficiency in Agentic AI Inference: A Survey
Pith reviewed 2026-05-10 17:33 UTC · model grok-4.3
The pith
Agentic AI inference faces compounding computational and communication energy costs that an accounting framework and unified taxonomy can organize for cross-layer optimizations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In this survey, we propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference. We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources.
What carries the argument
Energy accounting framework that tracks computational and communication costs through the Perception-Reasoning-Action cycle of Agentic AI, organized by a unified taxonomy of optimization techniques.
If this is right
- Cross-layer co-design enables joint optimization of model parameters, wireless transmissions, and edge resources to lower total energy use in mobile edge computing and autonomous systems.
- The taxonomy organizes existing techniques so that model simplification and hardware-aware inference can be applied together with input optimization for iterative loops.
- Identification of open challenges in federated green learning and carbon-aware agency points toward future directions for self-sustaining Agentic AI.
- The framework supports development of 6G-native Agentic AI by linking inference energy accounting directly to networking constraints.
Where Pith is reading between the lines
- The same accounting approach could be tested on non-LLM agentic pipelines to check whether communication costs remain dominant when perception involves sensor fusion rather than text.
- Extending the framework to include carbon-intensity signals from the grid could turn the taxonomy into a tool for carbon-aware scheduling of inference tasks.
- Applying the cross-layer ideas to multi-agent swarms might reveal new coordination overheads not captured in single-agent surveys.
Load-bearing premise
Computational and communication energy costs dominate in Agentic AI, and a single unified taxonomy plus cross-layer co-design can organize and solve the problem without missing major trade-offs or requiring entirely new categories.
What would settle it
A concrete demonstration that a substantial energy cost or trade-off in Agentic AI inference falls outside the four taxonomy categories or cannot be addressed by the surveyed cross-layer co-design approaches.
Figures
read the original abstract
The rapid emergence of Large Language Models (LLMs) has catalyzed Agentic artificial intelligence (AI), autonomous systems integrating perception, reasoning, and action into closed-loop pipelines for continuous adaptation. While unlocking transformative applications in mobile edge computing, autonomous systems, and next-generation wireless networks, this paradigm creates fundamental energy challenges through iterative inference and persistent data exchange. Unlike traditional AI where bottlenecks are computational Floating Point Operations (FLOPs), Agentic AI faces compounding computational and communication energy costs. In this survey, we propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference. We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources. Finally, we identify open challenges of federated green learning, carbon-aware agency, 6th generation mobile communication (6G)-native Agentic AI, and self-sustaining systems, providing a roadmap for scalable autonomous intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey proposes an energy accounting framework that identifies both computational and communication energy costs across the Perception-Reasoning-Action cycle in Agentic AI. It introduces a unified taxonomy organized into four areas—model simplification, computation control, input and attention optimization, and hardware-aware inference—while examining cross-layer co-design strategies that jointly optimize model parameters, wireless transmissions, and edge resources. The paper concludes by outlining open challenges in federated green learning, carbon-aware agency, 6G-native Agentic AI, and self-sustaining systems.
Significance. If the literature synthesis is comprehensive, the framework and taxonomy provide a timely organizing structure for the emerging intersection of Agentic AI and networked systems, where communication energy costs compound iterative inference. The cross-layer perspective and explicit roadmap of open challenges could help guide research in mobile edge computing and wireless AI, areas where pure compute-centric optimizations are insufficient.
minor comments (2)
- [Taxonomy section] The taxonomy is introduced as spanning four categories, but without an accompanying summary table or diagram that maps representative techniques and cited works to each category, readers may find it difficult to quickly assess coverage and relationships between areas.
- [Energy accounting framework] The energy accounting framework description would benefit from a concrete example or pseudocode illustrating how computational FLOPs and communication bits are combined into a total energy metric for a sample Perception-Reasoning-Action loop.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript's significance and for recommending minor revision. The summary accurately captures the core contributions of our energy accounting framework, taxonomy, and cross-layer co-design perspective for Agentic AI inference.
Circularity Check
No significant circularity in survey synthesis
full rationale
This is a survey paper proposing an energy accounting framework and unified taxonomy for Agentic AI by synthesizing cited literature across model simplification, computation control, input/attention optimization, and hardware-aware inference, plus cross-layer co-design. No mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems are asserted that could reduce by construction to the paper's own inputs or self-citations. The central claims are descriptive organization of external work, rendering the argument self-contained with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an energy accounting framework identifying computational and communication costs across the Perception-Reasoning-Action cycle. We establish a unified taxonomy spanning model simplification, computation control, input and attention optimization, and hardware-aware inference.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We explore cross-layer co-design strategies jointly optimizing model parameters, wireless transmissions, and edge resources.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rishabh Agrawal, Himanshu Kumar, and Shashikant Reddy Lnu. 2025. Efficient LLMs for edge devices: Pruning, quantization, and distillation techniques. In2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS). IEEE, 1413–1418
2025
-
[2]
Jin-Hyun Ahn, Osvaldo Simeone, and Joonhyuk Kang. 2020. Wireless Federated Distillation for Distributed Edge Learning with Heterogeneous Data.IEEE Transactions on Wireless Communications19, 11 (2020), 7130–7144
2020
-
[3]
Mo Ahtasam. 2025. DOL-LLM-Optimizing Large Language Model Inference with Domain-Specific Adaptations and Efficiency Techniques via Quantization, Pruning, and Distillation.Authorea Preprints(2025)
2025
-
[4]
Keivan Alizadeh, Iman Mirzadeh, Hooman Shahrokhi, Dmitry Belenko, Chenfan Sun, Minsik Cho, Mohammad Sekhavat, Moin Nabi, and Mehrdad Farajtabar. 2024. Duo-LLM: A Framework for Studying Adaptive Computation in ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey x:29...
2024
-
[5]
Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12562–12584
2024
-
[6]
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. 2024. Fluctuation-Based Adaptive Structured Pruning for Large Language Models. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
2024
-
[7]
Sudharshana B, Nandhini V, and AkilaGandhi G S ME. 2025. A Comprehensive Review of LLM Neural Network Enhancements for Advanced Driving Assistance Systems Through Quantization. In2025 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC). 1–7. doi:10.1109/ASSIC64892.2025.11158258
-
[8]
Hankyul Baek, Gyu Seon Kim, Soohyun Park, Andreas F. Molisch, and Joongheon Kim. 2025. Slimmable Federated Reinforcement Learning for Energy-Efficient Proactive Caching.IEEE Transactions on Networking33, 4 (2025), 2079–2094. doi:10.1109/TON.2025.3554608
-
[9]
Tong Bai, Bohan Huang, Zichuan Xu, Bo Hou, Haoran Zhao, and Zhipeng Wang. 2025. Adaptive Feature Compression and Resource Scheduling for End-Edge Co-Inference.IEEE Internet of Things Journal12, 18 (2025), 37255–37270. doi:10.1109/JIOT.2025.3582220
-
[10]
Krishna Bajpai and Vedanshi Gupta. 2025. EcoLLM: A Joint Optimization Framework for Ultra-Low Power, Mixed- Precision LLM Inference on Resource-Constrained Edge Systems.Authorea Preprints(2025)
2025
-
[11]
Pedram Bakhtiariifard, Christian Igel, and Raghavendra Selvan. 2024. EC-NAS: Energy Consumption Aware Tabular Benchmarks for Neural Architecture Search. InICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing. 5660–5664
2024
-
[12]
Rui Bao, Nan Xue, Yaping Sun, and Zhiyong Chen. 2025. Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks. In2025 IEEE/CIC International Conference on Communications in China (ICCC Workshops). IEEE. doi:10.1109/ICCCWorkshops67136.2025.11147210
-
[13]
S Bhardwaj, P Singh, and M K Pandit. 2024. A Survey on the Integration and Optimization of Large Language Models in Edge Computing Environments. In2024 16th International Conference on Computer and Automation Engineering (ICCAE). 168–172
2024
-
[14]
Parag Biswas, Abdur Rashid, Angona Biswas, Md Abdullah Al Nasim, Sovon Chakraborty, Kishor Datta Gupta, and Roy George. 2024. AI-driven approaches for optimizing power consumption: A comprehensive survey.Discover Artificial Intelligence4, 116 (2024)
2024
- [15]
-
[16]
Mohak Chadha, Thandayuthapani Subramanian, Eishi Arima, Michael Gerndt, Martin Schulz, and Osama Abboud
-
[17]
InProceedings of the 9th International Workshop on Serverless Computing
Greencourier: Carbon-aware scheduling for serverless functions. InProceedings of the 9th International Workshop on Serverless Computing. 18–23
-
[18]
Xiaojing Chen, Si Chen, Wei Ni, Xin Wang, Sihai Zhang, Shunqing Zhang, Yanzan Sun, Shugong Xu, and Abbas Jamalipour. 2024. Optimal Two-Timescale Configuration of Mobile Edge Computing With Mixed Energy Supply. IEEE Transactions on Smart Grid15, 5 (2024), 4765–4778. doi:10.1109/TSG.2024.3390772
-
[19]
Xiaojing Chen, Zhuoxiao Chen, Wei Ni, Zhenxu Bai, and Shunqing Zhang. 2024. Joint User Association and Resource Allocation for Smart-Grid-Powered Wireless Networks Under Constrained Carbon Emission.IEEE Wireless Communications Letters13, 11 (2024), 3217–3221. doi:10.1109/LWC.2024.3459010
-
[20]
Xiaojing Chen, Yijun Ding, Wei Ni, Xin Wang, Yichuang Sun, and Shunqing Zhang. 2025. Towards Dynamic Energy/Carbon Trading and Resource Allocation for MEC: A Two-Timescale Deep Reinforcement Learning Approach. In2025 IEEE/CIC International Conference on Communications in China (ICCC). 1–6. doi:10.1109/ICCC65529.2025. 11148917
-
[21]
Xiaojing Chen, Zhenyuan Li, Wei Ni, Xin Wang, Shunqing Zhang, Yanzan Sun, Shugong Xu, and Qingqi Pei. 2024. Toward Dynamic Resource Allocation and Client Scheduling in Hierarchical Federated Learning: A Two-Phase Deep Reinforcement Learning Approach.IEEE Trans. Commun.72, 12 (2024), 7798–7813. doi:10.1109/TCOMM.2024.3420733
-
[22]
Xiaojing Chen, Hanfei Wen, Wei Ni, Shunqing Zhang, Xin Wang, Shugong Xu, and Qingqi Pei. 2022. Distributed Online Optimization of Edge Computing With Mixed Power Supply of Renewable Energy and Smart Grid.IEEE Transactions on Communications70, 1 (2022), 389–403. doi:10.1109/TCOMM.2021.3123275
-
[23]
Yuxuan Chen, Rongpeng Li, Xiaoxue Yu, Zhifeng Zhao, and Honggang Zhang. 2025. Adaptive layer splitting for wireless large language model inference in edge computing: A model-based reinforcement learning approach.Frontiers of Information Technology & Electronic Engineering26, 2 (2025), 278–292. doi:10.1631/FITEE.2400468
-
[24]
Han Cho, Apurba Prasad Padhy, Fernando Camacho, and Saibal Mukhopadhyay. 2025. Sub 4-bit Power-of-Two Based Mixed-Precision Quantization for Efficient LLM Compression and Acceleration.IEEE Access(2025), 1–1. ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. x:30 Y. Liu et al. doi:10.1109/ACCESS.2025.3625771
-
[25]
Xuan-Toan Dang, Binh-Minh Vu, Quynh-Suong Nguyen, Thi-Thuy-Minh Tran, Joon-Soo Eom, and Oh-Soon Shin
-
[26]
A Survey on Energy-Efficient Design for Federated Learning over Wireless Networks.Energies17, 24 (2024). doi:10.3390/en17246485
-
[27]
Dantas, Lucas C
Pierre V. Dantas, Lucas C. Cordeiro, and Waldir S. S. Junior. 2025. A review of state-of-the-art techniques for large language model compression.Complex & Intelligent Systems11, 407 (2025)
2025
-
[28]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InProceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS)
2022
- [29]
-
[30]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized Large Language Models. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS)
2023
-
[31]
Hongyang Du, Zehui Li, Dusit Niyato, Jiawen Kang, Zehui Xiong, Xuemin Shen, and Dong In Kim. 2024. Enabling AI-Generated Content Services in Wireless Edge Networks.IEEE Wireless Communications31, 3 (2024), 226–234
2024
-
[32]
Kiannah Foster, Andrew Johansson, Elizabeth Williams, Daniel Petrovic, and Nicholas Kovalenko. 2024. A Token- Agnostic Approach to Controlling Generated Text Length in Large Language Models.Research Square(2024). doi:10.21203/rs.3.rs-5204102/v1 Preprint
-
[33]
Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot. InProceedings of the 40th International Conference on Machine Learning (ICML). PMLR
2023
-
[34]
Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh
Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. 2025. MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models. InProceedings of the 30th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’25). doi:10.1145/3710848.3710871
-
[35]
Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, and Yen-Chang Hsu. 2024. DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)
2024
-
[36]
Jay Gorvadiya, Ankur Chagela, and Mohendra Roy. 2025. Energy efficient pruning and quantization methods for deep learning models. In2025 International Conference on Sustainable Energy Technologies and Computational Intelligence (SETCOM). IEEE, 1–6
2025
-
[37]
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543(2024)
work page internal anchor Pith review arXiv 2024
-
[38]
Shuaishuai Guo, Yanhu Wang, Shujing Li, and Nasir Saeed. 2023. Semantic Importance-Aware Communications Using Pre-Trained Language Models.IEEE Communications Letters27, 9 (2023), 2328–2332. doi:10.1109/LCOMM.2023.3293805
-
[39]
Sama Habibi and Ozgur Ercetin. 2025. Edge-LLM Inference With Cost-Aware Layer Allocation and Adaptive Scheduling.IEEE Access13 (2025), 131614–131637. doi:10.1109/ACCESS.2025.3592308
-
[40]
Siem Hadish, Maher Guizani, Moayad Aloqaily, and Latif U. Khan. 2025. Transformer Based Architecture for Smart Grid Energy Consumption Forecasting. In2025 International Wireless Communications and Mobile Computing (IWCMC). 1726–1731. doi:10.1109/IWCMC65282.2025.11059615
- [41]
-
[42]
Zixu Hao, Huiqiang Jiang, Shiqi Jiang, Ju Ren, and Ting Cao. 2024. Hybrid SLM and LLM for Edge-Cloud Collaborative Inference. InProceedings of the Workshop on Edge and Mobile Foundation Models and Workshop on Mobile Computing with Large Language Models (EdgeFM ’24). ACM, 1–6. doi:10.1145/3662006.3662067
-
[43]
Zeqi Hao, Guoqing Xu, Yun Luo, Heng Hu, Jianping An, and Shiwen Mao. 2023. Multi-Agent Collaborative Inference via DNN Decoupling: Intermediate Feature Compression and Edge Learning.IEEE Transactions on Mobile Computing 22, 10 (2023), 6041–6055
2023
-
[44]
Ying He, Jingcheng Fang, F. Richard Yu, and Victor C. Leung. 2024. Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach.IEEE Transactions on Mobile Computing23, 12 (2024), 11253–11266. doi:10.1109/TMC.2024.3415661
-
[45]
Hennessy and David A
John L. Hennessy and David A. Patterson. 2017.Computer Architecture: A Quantitative Approach(6 ed.). Morgan Kaufmann, Cambridge, MA, USA
2017
-
[46]
Miao Hu, Qi He, and Di Wu. 2025. QLLMS: Quantization-Adaptive LLM Scheduling for Partially Informed Edge Serving Systems. InProceedings of IEEE INFOCOM. doi:10.1109/INFOCOM55648.2025.11044591 ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey x:31
-
[47]
Shuyan Hu, Xiaojing Chen, Wei Ni, Xin Wang, and Ekram Hossain. 2020. Modeling and Analysis of Energy Harvesting and Smart Grid-Powered Wireless Communication Networks: A Contemporary Survey.IEEE Transactions on Green Communications and Networking4, 2 (2020), 461–496. doi:10.1109/TGCN.2020.2988270
-
[48]
E J Husom, A Goknil, M Astekin, L K Shar, A Kåsen, S Sen, B A Mithassel, and A Soylu. 2025. Sustainable LLM Inference for Edge AI: Evaluating Quantized LLMs for Energy Efficiency, Output Accuracy, and Inference Latency. ACM Transactions on Internet of Things(2025)
2025
-
[49]
Kunal Jain, Anjaly Parayil, Ankur Mallick, Esha Choukse, Xiaoting Qin, Jue Zhang, Íñigo Goiri, Rujia Wang, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, and Saravan Rajmohan. 2025. Performance Aware LLM Load Balancer for Mixed Workloads. InProceedings of the 5th Workshop on Machine Learning and Systems (EuroMLSys ’25). ACM, 19–30. doi:10.1145...
- [50]
-
[51]
Chaoyi Jiang, Lei Gao, Hossein Entezari Zarch, and Murali Annavaram. 2025. KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation. InFindings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics, 19474–19488. https://github.com/chaoyij/KVPR
2025
-
[52]
Huiqiang Jiang, Qianhui Wu, Chin-Yew Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 13358–13376
2023
-
[53]
Yikun Jiang, Huanyu Wang, Lei Xie, Hanbin Zhao, Chao Zhang, Hui Qian, and John C.S. Lui. 2024. D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS). NeurIPS
2024
-
[54]
Hongpeng Jin and Yanzhao Wu. 2025. CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud- Edge Collaboration. In2025 IEEE International Conference on Web Services (ICWS). 316–323. doi:10.1109/ICWS67624. 2025.00046
-
[55]
Andreas Kosmas Kakolyris, Dimosthenis Masouros, Petros Vavaroutsos, Sotirios Xydis, and Dimitrios Soudris. 2025. ThrottLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1363–1378
2025
- [56]
-
[57]
Christopher Keith, Michael Robinson, Francis Duncan, Allan Worthington, Joseph Wilson, and Sofia Harris. 2024. Optimizing Large Language Models: A Novel Approach Through Dynamic Token Pruning.Research Square(2024). doi:10.21203/rs.3.rs-5293588/v1
-
[58]
Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. 2024. SwapMoE: Serving off-the-shelf MoE-based large language models with tunable memory budget. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6710–6720
2024
-
[59]
Sravani Kurma, Anal Paul, Keshav Singh, Kapal Dev, and Chih-Peng Li. 2025. LLMs for Resource Allocation in Next- Gen RIS-Aided Healthcare Wireless Networks. InIEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 1–6. doi:10.1109/INFOCOMWKSHPS65812.2025.11152831
-
[60]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23). ACM, 611–626. doi:10.1145/ 3600006.3613165
-
[61]
Lei Lei, Yaxiong Yuan, Yu Zhou, Yang Yang, Yu Luo, Lina Pu, and Symeon Chatzinotas. 2024. Energy Optimization and Lightweight Design for Efficient Federated Learning in Wireless Edge Systems.IEEE Transactions on Vehicular Technology73, 9 (2024), 13542–13557
2024
-
[62]
Hanxi Li, Guorong Chen, Bin Wang, Zheng Chen, Yongsheng Zhu, Fuqiang Hu, Jiao Dai, and Wei Wang. 2025. PFedKD: Personalized Federated Learning via Knowledge Distillation Using Unlabeled Pseudo Data for Internet of Things.IEEE Internet of Things Journal12, 11 (June 2025), 16314–16327. doi:10.1109/JIOT.2025.3533003
-
[63]
Jinrong Li, Biao Han, Sudan Li, Xiaoyan Wang, and Jie Li. 2024. CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices. InProceedings of the IEEE/CIC International Conference on Communications in China (ICCC). doi:10.1109/ICCC62479.2024.10681712
-
[64]
Shiyao Li, Xuefei Ning, Ke Hong, Tengxuan Liu, Luning Wang, Xiuhong Li, Kai Zhong, Guohao Dai, Huazhong Yang, and Yu Wang. 2023. LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment. InNeurIPS 2023 Workshop on Efficient Natural Language and Speech Processing. ACM Comput. Surv., Vol. x, No. x, Article x. Publication date: March 2026. x:32 Y. Liu et al
2023
-
[65]
Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems.Proceedings of the ACM on Management of Data (SIGMOD)3, 4 (2025), Article 250. doi:10.1145/3749168
-
[66]
Zonghang Li, Wenjiao Feng, Mohsen Guizani, and Hongfang Yu. 2025. TPI-LLM: Serving 70B-Scale LLMs Efficiently on Low-Resource Mobile Devices.IEEE Transactions on Services Computing18, 5 (2025), 3321–3333. doi:10.1109/TSC. 2025.3596892
work page doi:10.1109/tsc 2025
-
[67]
Chengsi Liang, Hongyang Du, Yao Sun, Dusit Niyato, Jiawen Kang, Dezong Zhao, and Muhammad Ali Imran. 2025. Generative AI-Driven Semantic Communication Networks: Architecture, Technologies, and Applications.IEEE Transactions on Cognitive Communications and Networking11, 1 (2025), 27–47. doi:10.1109/TCCN.2024.3435524
-
[68]
Gui Ling, Ziyang Wang, Yuliang Yan, and Qingwen Liu. 2024. SlimGPT: Layer-wise Structured Pruning for Large Language Models. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)
2024
-
[69]
Dong Liu and Yanxuan Yu. 2025. TinyServe: Query-Aware Cache Selection for Efficient LLM Serving. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). ACM, 12528–12536. doi:10.1145/3746027.3758181
-
[70]
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, and Chao Li
- [71]
-
[72]
Shu Liu, Dingzhu Wen, Da Li, Qimei Chen, Guangxu Zhu, and Yuanming Shi. 2024. Energy-Efficient Optimal Mode Selection for Edge AI Inference via Integrated Sensing-Communication-Computation.IEEE Transactions on Mobile Computing23, 12 (2024), 14248–14262. doi:10.1109/TMC.2024.3440581
- [73]
-
[74]
Yuxuan Liu. 2024. Learning to Reason with Autoregressive In-Context Distillation. InProceedings of the International Conference on Learning Representations (ICLR), Tiny Papers Track
2024
-
[75]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InProceedings of the ACM SIGCOMM 2024 Conference. ACM, 1–18. doi:10....
-
[76]
Zhang Liu, Hongyang Du, Lianfen Huang, Zhibin Gao, and Dusit Niyato. 2025. Joint Model Caching and Resource Allocation in Generative AI-Enabled Wireless Edge Networks. In2025 IEEE Wireless Communications and Networking Conference (WCNC). 1–6. doi:10.1109/WCNC61545.2025.10978225
-
[77]
Zhihao Liu, Xianliang Yang, Zichuan Liu, Yifan Xia, Wei Jiang, Yuanyu Zhang, Lijuan Li, Guoliang Fan, Lei Song, and Bian Jiang. 2024. Knowing what not to do: Leverage language model insights for action space pruning in multi-agent reinforcement learning.arXiv preprint arXiv:2405.16854(2024)
-
[78]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. InInternational Conference on Machine Learning (ICML)
2024
-
[79]
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, and Vikas Chandra. 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. InInternational Conference on Machine Learning (ICML)
2024
-
[80]
Jianlan Luo, Perry Dong, Jeffrey Wu, Aviral Kumar, Xinyang Geng, and Sergey Levine. 2023. Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning. InProceedings of the 7th Conference on Robot Learning (CoRL). Atlanta, USA. https://saqrl.github.io
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.