pith. machine review for the scientific record. sign in

arxiv: 2604.22906 · v1 · submitted 2026-04-24 · 💻 cs.DC

Recognition: unknown

Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:43 UTC · model grok-4.3

classification 💻 cs.DC
keywords large language modelsedge inferencenetwork edgemodel optimizationresource managementsystem architecturesdistributed computing
0
0 comments X

The pith

Large language models can perform inference at the network edge through specialized system architectures, model optimizations, and resource management techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey identifies the unique challenges of running large language models on edge networks where memory and compute resources are limited. It reviews recent progress across system architectures for deployment, techniques to optimize and compress models, and approaches to resource management and scheduling. A sympathetic reader would care because successful edge inference could lower latency, improve privacy, and support AI applications in bandwidth-constrained or mobile settings. The paper synthesizes these techniques and maps future directions to make LLM capabilities practical in such environments.

Core claim

The central claim is that LLM inference at the network edge, despite its large memory and compute demands, can be addressed by surveying and combining advances in system architectures, model optimization and deployment, and resource management and scheduling, thereby unlocking the potential of LLMs in resource-constrained edge environments.

What carries the argument

The structured categorization of techniques into system architectures, model optimization and deployment, and resource management and scheduling that together handle the demands of LLMs at the edge.

If this is right

  • System architectures can distribute LLM computations across edge nodes to fit within hardware limits.
  • Model optimization and deployment methods reduce memory footprint and compute needs for edge devices.
  • Resource management and scheduling improve efficiency under varying loads and multiple users.
  • Future research directions identified can guide development of edge-specific LLM variants and frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Edge LLM inference could reduce reliance on cloud servers and associated data transmission costs.
  • It may enable more responsive and private AI services on mobile and IoT devices without constant connectivity.
  • Integration with existing edge computing platforms could accelerate adoption in real deployments.
  • Hardware-specific benchmarks on devices like smartphones or routers would test the scalability of the surveyed approaches.

Load-bearing premise

The reviewed techniques from the literature can be practically combined and scaled to real-world edge environments while maintaining acceptable accuracy and efficiency.

What would settle it

An experiment that combines the surveyed architectures, optimizations, and scheduling methods on standard edge hardware and shows either unacceptable accuracy loss or failure to meet efficiency targets would falsify the practicality claim.

Figures

Figures reproduced from arXiv: 2604.22906 by Arumugam Nallanathan, Bingjie Zhu, Dusit Niyato, Hyundong Shin, Jiangzhou Wang, Zhixiong Chen.

Figure 1
Figure 1. Figure 1: Architectures and inference process of LLMs. view at source ↗
Figure 2
Figure 2. Figure 2: Architectures of LLM edge inference: (a) single-edg view at source ↗
Figure 3
Figure 3. Figure 3: Decoding strategies: (a) Non-autoregressive. (b) E view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different parallelism methods. view at source ↗
read the original abstract

Large language models (LLMs) have advanced rapidly, emerging as versatile tools across fields thanks to their exceptional language understanding, generation, and reasoning capabilities. However, performing LLM inference at the network edge remains challenging due to their large memory and compute demands. This survey outlines the challenges specific to LLM edge inference and provides a comprehensive overview of recent progress, covering system architectures, model optimization and deployment, and resource management and scheduling. By synthesizing state-of-the-art techniques and mapping future directions, this survey aims to unlock the potential of LLMs in resource-constrained edge environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper is a survey on network edge inference for large language models. It outlines the challenges arising from the high memory and compute demands of LLMs when deployed at the edge, and synthesizes recent progress across three areas: system architectures, model optimization and deployment techniques, and resource management and scheduling. The survey concludes by mapping future research directions to enable practical LLM use in resource-constrained edge environments.

Significance. If the synthesis is accurate and reasonably complete, the survey would provide a useful consolidation of techniques for the distributed systems and edge computing communities, helping researchers identify relevant architectures and optimization strategies without needing to survey the rapidly growing literature independently. No novel derivations, proofs, or empirical results are presented, so significance rests entirely on the quality of the literature mapping.

major comments (1)
  1. [Abstract] Abstract: the claim of providing a 'comprehensive overview' of recent progress is not supported by any description of the literature search methodology, inclusion criteria, time window, or databases used. Without this, readers cannot evaluate completeness or selection bias, which directly affects the reliability of the central synthesis claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the single major comment below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of providing a 'comprehensive overview' of recent progress is not supported by any description of the literature search methodology, inclusion criteria, time window, or databases used. Without this, readers cannot evaluate completeness or selection bias, which directly affects the reliability of the central synthesis claim.

    Authors: We agree that adding an explicit description of the literature selection process would enhance the survey's rigor and allow readers to assess potential biases. In the revised manuscript, we will insert a new subsection (likely in the Introduction or as Section 2) outlining the survey methodology. This will include: databases searched (arXiv, Google Scholar, IEEE Xplore, ACM Digital Library), primary keywords and combinations (e.g., 'LLM edge inference', 'model compression for edge devices', 'distributed LLM serving'), time window (primarily post-2022 to capture the LLM scaling era, with key foundational works from earlier), and inclusion criteria (focus on system architectures, optimizations, and resource management for edge LLM inference; exclusion of purely algorithmic NLP papers without deployment considerations). We will also note that the synthesis draws from approximately 150 relevant works identified through this process. This addition addresses the concern without changing the paper's technical content or structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey synthesizes external literature without derivations or self-referential reductions

full rationale

This paper is a survey that outlines challenges in LLM edge inference and reviews existing techniques from the literature on architectures, optimization, deployment, and scheduling. It presents no novel equations, predictions, fitted parameters, or derivations that could reduce to inputs by construction. Central claims are descriptive overviews of external work rather than prescriptive results derived internally. No self-citation chains are load-bearing for any technical assertion, and the synthesis does not rename known results or smuggle ansatzes via citations. The paper is self-contained as a literature review against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no original mathematical derivations, data fits, or postulated entities. It relies entirely on summarizing existing published work.

pith-pipeline@v0.9.0 · 5407 in / 900 out tokens · 23288 ms · 2026-05-08T09:43:21.143723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

214 extracted references · 62 canonical work pages · 24 internal anchors

  1. [1]

    Asad Aali, Adney Cardoza, and Melissa Capo. 2025. Splitw iser: Efficient LM inference with constrained resources. arXiv preprint arXiv:2505.03763 (2025)

  2. [2]

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bube ck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, et al. 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024)

  3. [3]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad , Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, et al

  4. [4]

    GPT-4 Technical Report

    Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  5. [5]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, et al. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In Proc. USENIX OSDI 24 . 117–134

  6. [6]

    Aida Amini, Saadia Gabriel, et al. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Ba sed Formalisms. In Proc. Conf. North American Chapter of ACL . 2357–2367

  7. [7]

    Apple. 2017. Core ML Tools. https://github.com/apple/coremltools

  8. [8]

    Apple. 2024. Private Cloud Compute: A new frontier for AI privacy in the cloud. https://security.apple.com/blog/p rivate-cloud-compute/. Manuscript submitted to ACM Network Edge Inference for Large Language Models: Principl es, Techniques, and Opportunities 29

  9. [9]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosm a, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai , et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)

  10. [10]

    Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying L u, Nan Zhang, Tingwei Shi, Ziyang Yu, et al. 2024. Beyond effici ency: A systematic survey of resource-efficient large language models. arXiv preprint arXiv:2401.00625 (2024)

  11. [11]

    Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin J iang, Qun Liu, Michael Lyu, and Irwin King. 2021. BinaryBERT : Pushing the Limit of BERT Quantization. In Proc. ACL. 4334–4348

  12. [12]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, et al. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. In Annual Meeting of the ACL . 3119–3137

  13. [13]

    Rui Bao, Nan Xue, Yaping Sun, and Zhiyong Chen. 2025. Dyn amic Quality-Latency Aware Routing for LLM Inference in Wir eless Edge-Device Networks. In IEEE Int. Conf. Commun. China . 1–6

  14. [14]

    Payel Bhattacharjee, Fengwei Tian, et al. 2025. Confor mal Sparsification for Bandwidth-Efficient Edge-Cloud Specu lative Decoding. In Proc. NeurIPS Workshop: AI and ML for Next-Generation Wireless Com mun. and Netw

  15. [15]

    Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R

    Nathan Binkert, Bradford Beckmann, Gabriel Black, Ste ven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, et al. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39 (2011), 1–7

  16. [16]

    Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, et al. 2023. Distributed Inference and Fi ne-tuning of Large Lan- guage Models Over The Internet. In Adv. Neural Infor. Process. Syst., Vol. 36. 12312–12331

  17. [17]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et a l. 2020. Language Models are Few-Shot Learners. In Adv. Neural Infor. Process. Syst., Vol. 33. 1877–1901

  18. [18]

    Dave Burke. 2023. A New Foundation for AI on Android. https://android-developers.googleblog.com/2023/12/a-ne w-foundation-for-ai-on-android.html

  19. [19]

    Rahul Chand, Yashoteja Prabhu, and Pratyush Kumar. 202 3. Dsformer: Effective compression of text-transformers by dense-sparse weight factor- ization. arXiv preprint arXiv:2312.13211 (2023)

  20. [20]

    Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du , Hailong Yang, Ruihao Gong, Shengzhong Liu, et al. 2025. Pre 3: Enabling Determin- istic Pushdown Automata for Faster Structured LLM Generati on. In Proc. ACL

  21. [21]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henri que Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yu ri Burda, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  22. [22]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zhe ng, Eddie Yan, Haichen Shen, Meghan Cowan, et al. 2018. TVM: A n Automated End-to- End Optimizing Compiler for Deep Learning. In Proc. USENIX OSDI . 578–594

  23. [23]

    Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, C henyang Zhao, Cheng Yang, et al. 2024. Internet of agents: We aving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061 (2024)

  24. [24]

    Yuxuan Chen et al. 2025. Adaptive layer splitting for wi reless large language model inference in edge computing: a m odel-based reinforcement learning approach. Frontiers of Infor. Technol. & Electronic Eng. 26, 2 (2025), 278–292

  25. [25]

    Zhixiong Chen, Wenqiang Yi, Yuanwei Liu, and Arumugam N allanathan. 2023. Knowledge-Aided Federated Learning for Energy-Limited Wireless Networks. IEEE Trans. Commun. 71, 6 (2023), 3368–3386

  26. [26]

    Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pier ce I-Jen Chuang, et al. 2018. Pact: Parameterized clipping a ctivation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018)

  27. [27]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, A shish Sabharwal, et al. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

  28. [28]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark C hen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  29. [29]

    Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, et al. 2023. Skipdecode: Autoregressive skip decoding w ith batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628 (2023)

  30. [30]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zett lemoyer. 2022. GPT3.int8(): 8-bit Matrix Multiplication f or Transformers at Scale. In Proc. Adv. Neural Infor. Process. Syst. (NeurIPS) , Vol. 35. 30318–30332

  31. [31]

    Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, D enis Kuznedelev, et al. 2023. Spqr: A sparse-quantized repr esentation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078 (2023)

  32. [32]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. ACL. 4171–4186

  33. [33]

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subha brata Mukherjee, Victor Ruhle, et al. 2024. Hybrid llm: Cost -efficient and quality-aware query routing. In Proc. Int. Conf. Learn. Represent. (ICLR) . 1–23

  34. [34]

    Guangyao Ding, Huiguo Gao, Shengli Liu, and Guanding Yu . 2025. Multi-Stage Semantic Communication for Low-Latenc y Edge Inference. IEEE Trans. Cogn. Commun. and Netw. (2025), 1–1

  35. [35]

    Yu Ding, Jingxuan Zhao, Zhengong Cai, et al. 2025. Adapt oserve: An Efficient System for Supporting Adaptive Chunked- Prefills in LLM Inference. In Proc. IEEE Int. Conf. High Perf. Comput. and Commun. (HPCC) . 1–9

  36. [36]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, R ui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2022). Manuscript submitted to ACM 30 Chen et al

  37. [37]

    Xiangyu Dong, Cong Xu, et al. 2012. NVSim: A Circuit-Lev el Performance, Energy, and Area Model for Emerging Nonvola tile Memory. IEEE Trans. Computer-Aided Design Integr. Circuits Syst. 31, 7 (2012), 994–1007

  38. [38]

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovic h, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, etal. 2024. Layerskip: Enabling early exit inference and self-speculative decoding. In Proc. ACL. 12622–12642

  39. [39]

    Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Ra thinakumar Appuswamy, and Dharmendra S Modha. 2019. Learned step size quantization. arXiv preprint arXiv:1902.08153 (2019)

  40. [40]

    ETSI ISG. 2017. Mobile Edge Computing; Market Acceleration; MEC Metrics Be st Practice and Guidelines . ETSI GS MEC-IEG 006 V1.1.1. ETSI. https://www.etsi.org/deliver/etsi_gs/mec-ieg/001_099/006/01.01.01_60/gs_mec-ieg006v010101p.pdf

  41. [41]

    Shiqing Fan, Yi Rong, et al. 2021. DAPPLE: A pipelined da ta parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 431–445

  42. [42]

    Jingqi Feng, Yukai Huang, Rui Zhang, et al. 2025. WindSe rve: Efficient Phase-Disaggregated LLM Serving with Stream- based Dynamic Scheduling. In Proc. Annual Int. Symp. Comput. Arch. (ISCA) . 1283–1295

  43. [43]

    Zideng Feng, Lu Lu, Qin Li, Yuhao Chai, Zhenyu Zhang, et a l. 2025. Distributed Inference Optimization for Large Lang uage Model in Edge-Cloud Collaborative Networks. In Proc. IEEE Int. Conf. Commun. 6161–6166

  44. [44]

    Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massi ve Language Models Can be Accurately Pruned in One-Shot. In Proc. Int. Conf. Mach. Learn. (ICML), Vol. 202. 10323–10337

  45. [45]

    Aadil Gani Ganie. 2025. Securing AI Agents: Implementi ng Role-Based Access Control for Industrial Applications. arXiv:2509.11431 (2025)

  46. [46]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, et al. 2 020. RealToxicityPrompts: Evaluating Neural Toxic Degene ration in Language Models. In Proc. EMNLP. 3356–3369

  47. [47]

    Georgi Gerganov. 2023. llama.cpp. https://github.com/ggml-org/llama.cpp

  48. [48]

    Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldber g. 2022. Transformer feed-forward layers build prediction s by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680 (2022)

  49. [49]

    Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Z ettlemoyer. 2019. Mask-predict: Parallel decoding of cond itional masked language models. arXiv preprint arXiv:1904.09324 (2019)

  50. [50]

    Andrea Goldsmith. 2005. Wireless communications. Cambridge university press

  51. [51]

    Raja Gond, Nipun Kwatra, and Ramachandran Ramjee. 2025 . TokenWeave: Efficient Compute-Communication Overlap for D istributed LLM Inference. arXiv preprint arXiv:2505.11329 (2025)

  52. [52]

    Suchi Gopal. 2024. The AI Gold Rush: Can Utilities Keep Up with the Energy Demand? https://floodlightglobal.com/the-ai-gold-rush-can-utilities-keep-up-with-the-energ y-demand/

  53. [53]

    Jiatao Gu and Xiang Kong. 2020. Fully non-autoregressi ve neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833 (2020)

  54. [54]

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. Min iLLM: Knowledge Distillation of Large Language Models. In Proc. Int. Conf. Learn. Represent. (ICLR)

  55. [55]

    Cong Guo, Jiaming Tang, Weiming Hu, Jingwen Leng, Chen Z hang, et al. 2023. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization. In Proc. Annual Int. Symp. Comput. Arch

  56. [56]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, et al. 2024 . DeepSeek-Coder: When the Large Language Model Meets Progr amming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024)

  57. [57]

    Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Ya n Liu. 2019. Non-Autoregressive Neural Machine Translatio n with Enhanced Decoder Input. Proc. AAAI 33, 01 (2019), 3723–3730

  58. [58]

    Zixu Hao et al. 2024. Hybrid SLM and LLM for Edge-Cloud Co llaborative Inference. In Proc. EdgeFM. 36–41

  59. [59]

    Ying He et al. 2024. Large Language Models (LLMs) Infere nce Offloading and Resource Allocation in Cloud-Edge Computi ng: An Active Inference Approach. IEEE Trans. Mobile Comput. 23, 12 (2024), 11253–11264

  60. [60]

    Thomas R Henderson, Mathieu Lacage, George F Riley, Cra ig Dowell, and Joseph Kopena. 2008. Network simulations wit h the ns-3 simulator. SIGCOMM demonstration 14, 14 (2008), 527

  61. [61]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, M antas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measu ring Massive Multitask Language Understanding. In Proc. Int. Conf. Learn. Represent. (ICLR)

  62. [62]

    Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstet te, Lasse Espeholt, et al. 2015. Teaching Machines to Read an d Comprehend. In Proc. Adv. Neural Infor. Process. Syst. (NeurIPS) , Vol. 28

  63. [63]

    Namgyu Ho et al. 2023. Large Language Models Are Reasoni ng Teachers. In Proc. ACL. 14852–14882

  64. [64]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar A hmad Awan, et al. 2024. Deepspeed-fastgen: High-throughpu t text generation for llms via mii and deepspeed-inference. arXiv preprint arXiv:2401.08671 (2024)

  65. [65]

    Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, et al. 2025 . semi-pd: Towards efficient llm serving via phase-wise disag gregated computation and unified storage. arXiv preprint arXiv:2504.19867 (2025)

  66. [66]

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, et al. 2024. KVQuant: Towards 10 Million Context Length LLM Infere nce with KV Cache Quantization. In Adv. Neural Infor. Process. Syst. (NeurIPS) , Vol. 37. 1270–1303

  67. [67]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. RUL ER: What’s the Real Context Size of Your Long-Context Language Models?. In Proc. Conf. Language Modeling . Manuscript submitted to ACM Network Edge Inference for Large Language Models: Principl es, Techniques, and Opportunities 31

  68. [68]

    Beerel, et al

    Yang Hu, Connor Imes, Xuanang Zhao, Souvik Kundu, Peter A. Beerel, et al. 2022. PipeEdge: Pipeline Parallelism for L arge-Scale Model Inference on Heterogeneous Edge Devices. In Proc. DSD. 298–307

  69. [69]

    Yang Hu, Connor Imes, Xuanang Zhao, Souvik Kundu, Peter A Beerel, Stephen P Crago, and John Paul N Walters. 2021. Pipe line parallelism for inference on heterogeneous edge computing. arXiv preprint arXiv:2110.14895 (2021)

  70. [70]

    Yitao Hu, Xiulong Liu, Guotao Yang, Linxuan Li, Kai Zeng , et al. 2025. TightLLM: Maximizing Throughput for LLM Infer ence via Adaptive Offloading Policy. IEEE Trans. Comput. 74, 7 (2025), 2195–2209

  71. [71]

    Sheng Hua, Yong Zhou, Kai Yang, Yuanming Shi, and Kunlun Wang. 2021. Reconfigurable Intelligent Surface for Green Ed ge Inference. IEEE Trans. Green Commun. and Netw. 5, 2 (2021), 964–979

  72. [72]

    Chongwen Huang, Alessio Zappone, et al. 2019. Reconfigu rable Intelligent Surfaces for Energy Efficiency in Wireless Communication. IEEE Trans. Wireless Commun. 18, 8 (2019), 4157–4170

  73. [73]

    Hugging Face. 2021. Optimum. https://github.com/huggingface/optimum

  74. [74]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng L iu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

  75. [75]

    Benoit Jacob, Skirmantas Kligys, et al. 2018. Quantiza tion and Training of Neural Networks for Efficient Integer-Ar ithmetic-Only Inference. In Proc. IEEE Conf. Comput. Vis. and Pattern Recog. (CVPR)

  76. [76]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, et al. 2023. M istral 7B. arXiv preprint arXiv:2310.06825 (2023)

  77. [77]

    Albert Q Jiang, Alexandre Sablayrolles, et al. 2024. Mi xtral of experts. arXiv preprint arXiv:2401.04088 (2024)

  78. [78]

    Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Min lan Yu. 2024. Neo: Saving gpu memory crisis with cpu offloading for online llm inference. arXiv preprint arXiv:2411.01142 (2024)

  79. [79]

    Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. 2023. Lion: Adversarial Distillation of Proprietary Large Language Models. In Proc. Conf. Empirical Methods in Natural Language Process. 3134–3154

  80. [80]

    Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, et al. 2024 . HexGen: Generative Inference of Large Language Model over Heterogeneous Envi- ronment. In Proc. Int. Conf. Machine Learn. (ICML) , Vol. 235. 21946–21961

Showing first 80 references.