arxiv: 2604.22906 · v1 · submitted 2026-04-24 · 💻 cs.DC

Recognition: unknown

Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

Zhixiong Chen , Bingjie Zhu , Jiangzhou Wang , Hyundong Shin , Arumugam Nallanathan , Dusit Niyato

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:43 UTC · model grok-4.3

classification 💻 cs.DC

keywords large language modelsedge inferencenetwork edgemodel optimizationresource managementsystem architecturesdistributed computing

0 comments

The pith

Large language models can perform inference at the network edge through specialized system architectures, model optimizations, and resource management techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey identifies the unique challenges of running large language models on edge networks where memory and compute resources are limited. It reviews recent progress across system architectures for deployment, techniques to optimize and compress models, and approaches to resource management and scheduling. A sympathetic reader would care because successful edge inference could lower latency, improve privacy, and support AI applications in bandwidth-constrained or mobile settings. The paper synthesizes these techniques and maps future directions to make LLM capabilities practical in such environments.

Core claim

The central claim is that LLM inference at the network edge, despite its large memory and compute demands, can be addressed by surveying and combining advances in system architectures, model optimization and deployment, and resource management and scheduling, thereby unlocking the potential of LLMs in resource-constrained edge environments.

What carries the argument

The structured categorization of techniques into system architectures, model optimization and deployment, and resource management and scheduling that together handle the demands of LLMs at the edge.

If this is right

System architectures can distribute LLM computations across edge nodes to fit within hardware limits.
Model optimization and deployment methods reduce memory footprint and compute needs for edge devices.
Resource management and scheduling improve efficiency under varying loads and multiple users.
Future research directions identified can guide development of edge-specific LLM variants and frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Edge LLM inference could reduce reliance on cloud servers and associated data transmission costs.
It may enable more responsive and private AI services on mobile and IoT devices without constant connectivity.
Integration with existing edge computing platforms could accelerate adoption in real deployments.
Hardware-specific benchmarks on devices like smartphones or routers would test the scalability of the surveyed approaches.

Load-bearing premise

The reviewed techniques from the literature can be practically combined and scaled to real-world edge environments while maintaining acceptable accuracy and efficiency.

What would settle it

An experiment that combines the surveyed architectures, optimizations, and scheduling methods on standard edge hardware and shows either unacceptable accuracy loss or failure to meet efficiency targets would falsify the practicality claim.

Figures

Figures reproduced from arXiv: 2604.22906 by Arumugam Nallanathan, Bingjie Zhu, Dusit Niyato, Hyundong Shin, Jiangzhou Wang, Zhixiong Chen.

**Figure 1.** Figure 1: Architectures and inference process of LLMs. view at source ↗

**Figure 2.** Figure 2: Architectures of LLM edge inference: (a) single-edg view at source ↗

**Figure 3.** Figure 3: Decoding strategies: (a) Non-autoregressive. (b) E view at source ↗

**Figure 4.** Figure 4: Comparison of different parallelism methods. view at source ↗

read the original abstract

Large language models (LLMs) have advanced rapidly, emerging as versatile tools across fields thanks to their exceptional language understanding, generation, and reasoning capabilities. However, performing LLM inference at the network edge remains challenging due to their large memory and compute demands. This survey outlines the challenges specific to LLM edge inference and provides a comprehensive overview of recent progress, covering system architectures, model optimization and deployment, and resource management and scheduling. By synthesizing state-of-the-art techniques and mapping future directions, this survey aims to unlock the potential of LLMs in resource-constrained edge environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that organizes existing work on edge LLM inference but adds no new technical results or evidence.

read the letter

The main takeaway is that this paper is a literature survey on performing inference for large language models at the network edge. It does not introduce new methods, data, or proofs of its own. What it does well is to outline the specific difficulties, such as high memory and compute needs, and then group recent work into system architectures, model optimization and deployment, and resource management and scheduling. This organization gives a clear map of the field and points toward future research opportunities. For someone trying to understand the current approaches without reading dozens of papers, this synthesis can be practical. The soft spots are that the paper provides no information on how the literature was searched or selected, which makes it hard to judge if the overview is complete or if there are biases in what was included. The central assumption that the described techniques can be combined and scaled to real edge devices while maintaining performance is not backed by any additional evidence or testing here. It stays at the level of describing what others have done. This paper is for researchers and practitioners in distributed systems and edge computing who need an overview of LLM deployment options. A reader interested in the subfield would find value in the structured summary and the cited references. It is not for those looking for original contributions or validated end-to-end solutions. I recommend sending it for peer review. Referees could help ensure the coverage is thorough and that the future directions are well-justified.

Referee Report

1 major / 0 minor

Summary. The paper is a survey on network edge inference for large language models. It outlines the challenges arising from the high memory and compute demands of LLMs when deployed at the edge, and synthesizes recent progress across three areas: system architectures, model optimization and deployment techniques, and resource management and scheduling. The survey concludes by mapping future research directions to enable practical LLM use in resource-constrained edge environments.

Significance. If the synthesis is accurate and reasonably complete, the survey would provide a useful consolidation of techniques for the distributed systems and edge computing communities, helping researchers identify relevant architectures and optimization strategies without needing to survey the rapidly growing literature independently. No novel derivations, proofs, or empirical results are presented, so significance rests entirely on the quality of the literature mapping.

major comments (1)

[Abstract] Abstract: the claim of providing a 'comprehensive overview' of recent progress is not supported by any description of the literature search methodology, inclusion criteria, time window, or databases used. Without this, readers cannot evaluate completeness or selection bias, which directly affects the reliability of the central synthesis claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the single major comment below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of providing a 'comprehensive overview' of recent progress is not supported by any description of the literature search methodology, inclusion criteria, time window, or databases used. Without this, readers cannot evaluate completeness or selection bias, which directly affects the reliability of the central synthesis claim.

Authors: We agree that adding an explicit description of the literature selection process would enhance the survey's rigor and allow readers to assess potential biases. In the revised manuscript, we will insert a new subsection (likely in the Introduction or as Section 2) outlining the survey methodology. This will include: databases searched (arXiv, Google Scholar, IEEE Xplore, ACM Digital Library), primary keywords and combinations (e.g., 'LLM edge inference', 'model compression for edge devices', 'distributed LLM serving'), time window (primarily post-2022 to capture the LLM scaling era, with key foundational works from earlier), and inclusion criteria (focus on system architectures, optimizations, and resource management for edge LLM inference; exclusion of purely algorithmic NLP papers without deployment considerations). We will also note that the synthesis draws from approximately 150 relevant works identified through this process. This addition addresses the concern without changing the paper's technical content or structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey synthesizes external literature without derivations or self-referential reductions

full rationale

This paper is a survey that outlines challenges in LLM edge inference and reviews existing techniques from the literature on architectures, optimization, deployment, and scheduling. It presents no novel equations, predictions, fitted parameters, or derivations that could reduce to inputs by construction. Central claims are descriptive overviews of external work rather than prescriptive results derived internally. No self-citation chains are load-bearing for any technical assertion, and the synthesis does not rename known results or smuggle ansatzes via citations. The paper is self-contained as a literature review against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no original mathematical derivations, data fits, or postulated entities. It relies entirely on summarizing existing published work.

pith-pipeline@v0.9.0 · 5407 in / 900 out tokens · 23288 ms · 2026-05-08T09:43:21.143723+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

214 extracted references · 62 canonical work pages · 24 internal anchors

[1]

Asad Aali, Adney Cardoza, and Melissa Capo. 2025. Splitw iser: Eﬃcient LM inference with constrained resources. arXiv preprint arXiv:2505.03763 (2025)

work page arXiv 2025
[2]

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bube ck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, et al. 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024)

work page internal anchor Pith review arXiv 2024
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad , Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, et al
[4]

GPT-4 Technical Report

Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review arXiv 2023
[5]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, et al. 2024. Taming Throughput-Latency Tradeoﬀ in LLM Inference with Sarathi-Serve. In Proc. USENIX OSDI 24 . 117–134

2024
[6]

Aida Amini, Saadia Gabriel, et al. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Ba sed Formalisms. In Proc. Conf. North American Chapter of ACL . 2357–2367

2019
[7]

Apple. 2017. Core ML Tools. https://github.com/apple/coremltools

2017
[8]

Apple. 2024. Private Cloud Compute: A new frontier for AI privacy in the cloud. https://security.apple.com/blog/p rivate-cloud-compute/. Manuscript submitted to ACM Network Edge Inference for Large Language Models: Principl es, Techniques, and Opportunities 29

2024
[9]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosm a, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai , et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review arXiv 2021
[10]

Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying L u, Nan Zhang, Tingwei Shi, Ziyang Yu, et al. 2024. Beyond eﬃci ency: A systematic survey of resource-eﬃcient large language models. arXiv preprint arXiv:2401.00625 (2024)

work page arXiv 2024
[11]

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin J iang, Qun Liu, Michael Lyu, and Irwin King. 2021. BinaryBERT : Pushing the Limit of BERT Quantization. In Proc. ACL. 4334–4348

2021
[12]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, et al. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. In Annual Meeting of the ACL . 3119–3137

2024
[13]

Rui Bao, Nan Xue, Yaping Sun, and Zhiyong Chen. 2025. Dyn amic Quality-Latency Aware Routing for LLM Inference in Wir eless Edge-Device Networks. In IEEE Int. Conf. Commun. China . 1–6

2025
[14]

Payel Bhattacharjee, Fengwei Tian, et al. 2025. Confor mal Sparsiﬁcation for Bandwidth-Eﬃcient Edge-Cloud Specu lative Decoding. In Proc. NeurIPS Workshop: AI and ML for Next-Generation Wireless Com mun. and Netw

2025
[15]

Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R

Nathan Binkert, Bradford Beckmann, Gabriel Black, Ste ven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, et al. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39 (2011), 1–7

2011
[16]

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, et al. 2023. Distributed Inference and Fi ne-tuning of Large Lan- guage Models Over The Internet. In Adv. Neural Infor. Process. Syst., Vol. 36. 12312–12331

2023
[17]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et a l. 2020. Language Models are Few-Shot Learners. In Adv. Neural Infor. Process. Syst., Vol. 33. 1877–1901

2020
[18]

Dave Burke. 2023. A New Foundation for AI on Android. https://android-developers.googleblog.com/2023/12/a-ne w-foundation-for-ai-on-android.html

2023
[19]

Rahul Chand, Yashoteja Prabhu, and Pratyush Kumar. 202 3. Dsformer: Eﬀective compression of text-transformers by dense-sparse weight factor- ization. arXiv preprint arXiv:2312.13211 (2023)

work page arXiv 2023
[20]

Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du , Hailong Yang, Ruihao Gong, Shengzhong Liu, et al. 2025. Pre 3: Enabling Determin- istic Pushdown Automata for Faster Structured LLM Generati on. In Proc. ACL

2025
[21]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henri que Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yu ri Burda, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review arXiv 2021
[22]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zhe ng, Eddie Yan, Haichen Shen, Meghan Cowan, et al. 2018. TVM: A n Automated End-to- End Optimizing Compiler for Deep Learning. In Proc. USENIX OSDI . 578–594

2018
[23]

Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, C henyang Zhao, Cheng Yang, et al. 2024. Internet of agents: We aving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061 (2024)

work page arXiv 2024
[24]

Yuxuan Chen et al. 2025. Adaptive layer splitting for wi reless large language model inference in edge computing: a m odel-based reinforcement learning approach. Frontiers of Infor. Technol. & Electronic Eng. 26, 2 (2025), 278–292

2025
[25]

Zhixiong Chen, Wenqiang Yi, Yuanwei Liu, and Arumugam N allanathan. 2023. Knowledge-Aided Federated Learning for Energy-Limited Wireless Networks. IEEE Trans. Commun. 71, 6 (2023), 3368–3386

2023
[26]

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pier ce I-Jen Chuang, et al. 2018. Pact: Parameterized clipping a ctivation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)

work page arXiv 2018
[27]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, A shish Sabharwal, et al. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review arXiv 2018
[28]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark C hen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, et al. 2021. Training veriﬁers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review arXiv 2021
[29]

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, et al. 2023. Skipdecode: Autoregressive skip decoding w ith batching and caching for eﬃcient llm inference. arXiv preprint arXiv:2307.02628 (2023)

work page arXiv 2023
[30]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zett lemoyer. 2022. GPT3.int8(): 8-bit Matrix Multiplication f or Transformers at Scale. In Proc. Adv. Neural Infor. Process. Syst. (NeurIPS) , Vol. 35. 30318–30332

2022
[31]

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, D enis Kuznedelev, et al. 2023. Spqr: A sparse-quantized repr esentation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078 (2023)

work page arXiv 2023
[32]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. ACL. 4171–4186

2019
[33]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subha brata Mukherjee, Victor Ruhle, et al. 2024. Hybrid llm: Cost -eﬃcient and quality-aware query routing. In Proc. Int. Conf. Learn. Represent. (ICLR) . 1–23

2024
[34]

Guangyao Ding, Huiguo Gao, Shengli Liu, and Guanding Yu . 2025. Multi-Stage Semantic Communication for Low-Latenc y Edge Inference. IEEE Trans. Cogn. Commun. and Netw. (2025), 1–1

2025
[35]

Yu Ding, Jingxuan Zhao, Zhengong Cai, et al. 2025. Adapt oserve: An Eﬃcient System for Supporting Adaptive Chunked- Preﬁlls in LLM Inference. In Proc. IEEE Int. Conf. High Perf. Comput. and Commun. (HPCC) . 1–9

2025
[36]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, R ui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022). Manuscript submitted to ACM 30 Chen et al

work page internal anchor Pith review arXiv 2022
[37]

Xiangyu Dong, Cong Xu, et al. 2012. NVSim: A Circuit-Lev el Performance, Energy, and Area Model for Emerging Nonvola tile Memory. IEEE Trans. Computer-Aided Design Integr. Circuits Syst. 31, 7 (2012), 994–1007

2012
[38]

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovic h, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, etal. 2024. Layerskip: Enabling early exit inference and self-speculative decoding. In Proc. ACL. 12622–12642

2024
[39]

Steven K Esser, Jeﬀrey L McKinstry, Deepika Bablani, Ra thinakumar Appuswamy, and Dharmendra S Modha. 2019. Learned step size quantization. arXiv preprint arXiv:1902.08153 (2019)

work page arXiv 2019
[40]

ETSI ISG. 2017. Mobile Edge Computing; Market Acceleration; MEC Metrics Be st Practice and Guidelines . ETSI GS MEC-IEG 006 V1.1.1. ETSI. https://www.etsi.org/deliver/etsi_gs/mec-ieg/001_099/006/01.01.01_60/gs_mec-ieg006v010101p.pdf

2017
[41]

Shiqing Fan, Yi Rong, et al. 2021. DAPPLE: A pipelined da ta parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 431–445

2021
[42]

Jingqi Feng, Yukai Huang, Rui Zhang, et al. 2025. WindSe rve: Eﬃcient Phase-Disaggregated LLM Serving with Stream- based Dynamic Scheduling. In Proc. Annual Int. Symp. Comput. Arch. (ISCA) . 1283–1295

2025
[43]

Zideng Feng, Lu Lu, Qin Li, Yuhao Chai, Zhenyu Zhang, et a l. 2025. Distributed Inference Optimization for Large Lang uage Model in Edge-Cloud Collaborative Networks. In Proc. IEEE Int. Conf. Commun. 6161–6166

2025
[44]

Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massi ve Language Models Can be Accurately Pruned in One-Shot. In Proc. Int. Conf. Mach. Learn. (ICML), Vol. 202. 10323–10337

2023
[45]

Aadil Gani Ganie. 2025. Securing AI Agents: Implementi ng Role-Based Access Control for Industrial Applications. arXiv:2509.11431 (2025)

work page arXiv 2025
[46]

Samuel Gehman, Suchin Gururangan, Maarten Sap, et al. 2 020. RealToxicityPrompts: Evaluating Neural Toxic Degene ration in Language Models. In Proc. EMNLP. 3356–3369
[47]

Georgi Gerganov. 2023. llama.cpp. https://github.com/ggml-org/llama.cpp

2023
[48]

Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldber g. 2022. Transformer feed-forward layers build prediction s by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680 (2022)

work page arXiv 2022
[49]

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Z ettlemoyer. 2019. Mask-predict: Parallel decoding of cond itional masked language models. arXiv preprint arXiv:1904.09324 (2019)

work page arXiv 2019
[50]

Andrea Goldsmith. 2005. Wireless communications. Cambridge university press

2005
[51]

Raja Gond, Nipun Kwatra, and Ramachandran Ramjee. 2025 . TokenWeave: Eﬃcient Compute-Communication Overlap for D istributed LLM Inference. arXiv preprint arXiv:2505.11329 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Suchi Gopal. 2024. The AI Gold Rush: Can Utilities Keep Up with the Energy Demand? https://ﬂoodlightglobal.com/the-ai-gold-rush-can-utilities-keep-up-with-the-energ y-demand/

2024
[53]

Jiatao Gu and Xiang Kong. 2020. Fully non-autoregressi ve neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833 (2020)

work page arXiv 2020
[54]

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. Min iLLM: Knowledge Distillation of Large Language Models. In Proc. Int. Conf. Learn. Represent. (ICLR)

2024
[55]

Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Z hang, et al. 2023. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization. In Proc. Annual Int. Symp. Comput. Arch

2023
[56]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, et al. 2024 . DeepSeek-Coder: When the Large Language Model Meets Progr amming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024)

work page internal anchor Pith review arXiv 2024
[57]

Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Ya n Liu. 2019. Non-Autoregressive Neural Machine Translatio n with Enhanced Decoder Input. Proc. AAAI 33, 01 (2019), 3723–3730

2019
[58]

Zixu Hao et al. 2024. Hybrid SLM and LLM for Edge-Cloud Co llaborative Inference. In Proc. EdgeFM. 36–41

2024
[59]

Ying He et al. 2024. Large Language Models (LLMs) Infere nce Oﬄoading and Resource Allocation in Cloud-Edge Computi ng: An Active Inference Approach. IEEE Trans. Mobile Comput. 23, 12 (2024), 11253–11264

2024
[60]

Thomas R Henderson, Mathieu Lacage, George F Riley, Cra ig Dowell, and Joseph Kopena. 2008. Network simulations wit h the ns-3 simulator. SIGCOMM demonstration 14, 14 (2008), 527

2008
[61]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, M antas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measu ring Massive Multitask Language Understanding. In Proc. Int. Conf. Learn. Represent. (ICLR)

2021
[62]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstet te, Lasse Espeholt, et al. 2015. Teaching Machines to Read an d Comprehend. In Proc. Adv. Neural Infor. Process. Syst. (NeurIPS) , Vol. 28

2015
[63]

Namgyu Ho et al. 2023. Large Language Models Are Reasoni ng Teachers. In Proc. ACL. 14852–14882

2023
[64]

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar A hmad Awan, et al. 2024. Deepspeed-fastgen: High-throughpu t text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671 (2024)

work page arXiv 2024
[65]

Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, et al. 2025 . semi-pd: Towards eﬃcient llm serving via phase-wise disag gregated computation and uniﬁed storage. arXiv preprint arXiv:2504.19867 (2025)

work page arXiv 2025
[66]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, et al. 2024. KVQuant: Towards 10 Million Context Length LLM Infere nce with KV Cache Quantization. In Adv. Neural Infor. Process. Syst. (NeurIPS) , Vol. 37. 1270–1303

2024
[67]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. RUL ER: What’s the Real Context Size of Your Long-Context Language Models?. In Proc. Conf. Language Modeling . Manuscript submitted to ACM Network Edge Inference for Large Language Models: Principl es, Techniques, and Opportunities 31

2024
[68]

Beerel, et al

Yang Hu, Connor Imes, Xuanang Zhao, Souvik Kundu, Peter A. Beerel, et al. 2022. PipeEdge: Pipeline Parallelism for L arge-Scale Model Inference on Heterogeneous Edge Devices. In Proc. DSD. 298–307

2022
[69]

Yang Hu, Connor Imes, Xuanang Zhao, Souvik Kundu, Peter A Beerel, Stephen P Crago, and John Paul N Walters. 2021. Pipe line parallelism for inference on heterogeneous edge computing. arXiv preprint arXiv:2110.14895 (2021)

work page arXiv 2021
[70]

Yitao Hu, Xiulong Liu, Guotao Yang, Linxuan Li, Kai Zeng , et al. 2025. TightLLM: Maximizing Throughput for LLM Infer ence via Adaptive Oﬄoading Policy. IEEE Trans. Comput. 74, 7 (2025), 2195–2209

2025
[71]

Sheng Hua, Yong Zhou, Kai Yang, Yuanming Shi, and Kunlun Wang. 2021. Reconﬁgurable Intelligent Surface for Green Ed ge Inference. IEEE Trans. Green Commun. and Netw. 5, 2 (2021), 964–979

2021
[72]

Chongwen Huang, Alessio Zappone, et al. 2019. Reconﬁgu rable Intelligent Surfaces for Energy Eﬃciency in Wireless Communication. IEEE Trans. Wireless Commun. 18, 8 (2019), 4157–4170

2019
[73]

Hugging Face. 2021. Optimum. https://github.com/huggingface/optimum

2021
[74]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng L iu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review arXiv 2024
[75]

Benoit Jacob, Skirmantas Kligys, et al. 2018. Quantiza tion and Training of Neural Networks for Eﬃcient Integer-Ar ithmetic-Only Inference. In Proc. IEEE Conf. Comput. Vis. and Pattern Recog. (CVPR)

2018
[76]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, et al. 2023. M istral 7B. arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review arXiv 2023
[77]

Albert Q Jiang, Alexandre Sablayrolles, et al. 2024. Mi xtral of experts. arXiv preprint arXiv:2401.04088 (2024)

work page internal anchor Pith review arXiv 2024
[78]

Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Min lan Yu. 2024. Neo: Saving gpu memory crisis with cpu oﬄoading for online llm inference. arXiv preprint arXiv:2411.01142 (2024)

work page arXiv 2024
[79]

Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. 2023. Lion: Adversarial Distillation of Proprietary Large Language Models. In Proc. Conf. Empirical Methods in Natural Language Process. 3134–3154

2023
[80]

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, et al. 2024 . HexGen: Generative Inference of Large Language Model over Heterogeneous Envi- ronment. In Proc. Int. Conf. Machine Learn. (ICML) , Vol. 235. 21946–21961

2024

Showing first 80 references.