arxiv: 2601.11652 · v2 · submitted 2026-01-15 · 💻 cs.DC · cs.AI

Recognition: no theorem link

WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

Xiangchen Li , Jiakun Fan , Qingyuan Wang , Dimitrios Spatharakis , Saeid Ghafouri , Hans Vandierendonck , Deepu John , Bo Ji

show 2 more authors

Ali R. Butt Dimitrios S. Nikolopoulos

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:08 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords distributed LLM servingspeculative decodingedge computingSLO-aware batchingwaste suppressionverification interferencedynamic drafting

0 comments

The pith

WISP suppresses wasted drafting time and verification interference in distributed LLM serving through dynamic edge drafting and SLO-aware server batching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that two key bottlenecks, wasted drafting time on edges and interference during server verification, limit the scalability of using edge devices for speculative LLM inference. WISP counters these with an intelligent speculation controller, verification time estimator, and batch scheduler that optimize the split between edge and cloud while maintaining lossless accuracy. This matters because it promises to alleviate data center overload by leveraging underutilized edge devices for the growing volume of user inference requests from edge devices.

Core claim

WISP formalizes wasted drafting time and verification interference as bottlenecks in distributed speculative LLM serving and mitigates them via dynamic drafting control, time estimation, and SLO-aware batching, yielding capacity improvements of up to 2.1x over centralized serving and 4.1x over SLED, with corresponding goodput increases of 1.94x and 3.7x.

What carries the argument

Intelligent speculation controller with verification time estimator and SLO-aware verification batch scheduler that collaboratively reduce waste and interference in edge-cloud speculative decoding.

If this is right

Edge devices handle initial drafts to reduce central computation load for LLM inference.
Verification scheduling avoids interference to improve server efficiency and throughput.
Overall system capacity increases allow handling more concurrent inference requests.
Goodput rises because fewer resources are wasted on incorrect drafts and scheduling conflicts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar control mechanisms could reduce overhead in other speculative or predictive distributed systems beyond LLMs.
Incorporating variable network latency into the time estimator might further improve performance in real-world edge settings.
Energy use at edge devices could decrease if drafting remains lightweight, enabling longer operation in battery-powered scenarios.

Load-bearing premise

Edge devices can perform reliable drafting with negligible local overhead and speculation accuracy remains lossless when the drafting and verification are split across the network.

What would settle it

A real deployment measurement showing that local edge drafting overhead exceeds server savings or that distributed speculation reduces accuracy compared to centralized methods.

Figures

Figures reproduced from arXiv: 2601.11652 by Ali R. Butt, Bo Ji, Deepu John, Dimitrios S. Nikolopoulos, Dimitrios Spatharakis, Hans Vandierendonck, Jiakun Fan, Qingyuan Wang, Saeid Ghafouri, Xiangchen Li.

**Figure 1.** Figure 1: Impact of WDT on Device Goodput 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Acceptance Rate Draft Token Acceptance Rate vs Confidence Score Acceptance Rate Overall: 0.732 0.0 0.2 0.4 0.6 0.8 1.0 Confidence Score 0 2000 4000 6000 Sample Count (a) Confidence Score 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Acceptance Rate Draft Token Acceptance Rate vs Entropy Acceptance Rate Overall: 0.732 0.0 0.5 1.0 … view at source ↗

**Figure 2.** Figure 2: Relationships of Draft Logits Statistics with Acceptance Rate [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Correlation between Logits Statistics and Token acceptance [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Verification Interference among Heterogeneous Requests [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: WISP architecture and data flow 4 WISP Design Motivated by the lack of SLO-awareness, the wasted drafting time, the predictable acceptance rate, and the Verification Interference observed in prior distributed speculative edge LLM serving systems, we present WISP. In WISP, user devices speculate draft tokens while the GPU cluster batches and verification requests, enabling the system to serve more edge devi… view at source ↗

**Figure 6.** Figure 6: Architecture of verification engine Computational modeling. First, we analyze the linear operations (embeddings, QKV projections, FFNs). These layers are stateless and process each token independently. Regardless of whether a request is a "first-time" prompt or a "follow-up" draft, the computational cost scales strictly with the number of tokens entering the GPU. Thus, the aggregate linear complexity is si… view at source ↗

**Figure 7.** Figure 7: Token Generation Speed and SLO Violations [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Violation attribution in the queue–compute plane under [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Additive verification-latency predictor: end-to-end fit and residual diagnostics. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Row-normalized confusion matrices of four predictors. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Error breakdown of the additive latency predictor across workload regimes [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) become increasingly accessible to end users, an ever-growing number of inference requests are initiated from edge devices and computed on centralized GPU clusters. However, the resulting exponential growth in computation workload is placing significant strain on data centers, while edge devices remain largely underutilized, leading to imbalanced workloads and resource inefficiency across the network. Integrating edge devices into the LLM inference process via speculative decoding helps balance the workload between the edge and the cloud, while maintaining lossless prediction accuracy. In this paper, we identify and formalize two critical bottlenecks that limit the efficiency and scalability of distributed speculative LLM serving: Wasted Drafting Time and Verification Interference. To address these challenges, we propose WISP, an efficient and SLO-aware distributed LLM inference system that consists of an intelligent speculation controller, a verification time estimator, and a verification batch scheduler. These components collaboratively enhance drafting efficiency and optimize verification request scheduling on the server. Extensive numerical results show that WISP improves system capacity by up to 2.1x and 4.1x, and increases system goodput by up to 1.94x and 3.7x, compared to centralized serving and SLED, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WISP targets wasted drafting time and verification interference in edge-cloud speculative serving with three coordinated components, but the large reported gains rest on assumptions about negligible edge overhead and lossless network splits that lack sensitivity checks.

read the letter

WISP claims to boost system capacity by up to 2.1x and goodput by 1.94x over centralized serving by handling wasted drafting time and verification interference in edge-cloud speculative LLM serving. Those gains rest on the idea that edge devices can draft with almost no overhead and that splitting the process across the network keeps the output distribution identical to a standard run. The paper introduces three coordinated pieces: an intelligent speculation controller, a verification time estimator, and an SLO-aware verification batch scheduler. This setup targets the specific bottlenecks in distributed speculative decoding, which goes beyond earlier work like SLED by adding dynamic control and scheduling aware of service level objectives. It does well at identifying the workload imbalance between data centers and underused edge devices. The approach keeps the lossless property of speculative decoding while trying to balance load. The description of how the components collaborate is straightforward and practical. The soft spot is the lack of detail around the experiments. The abstract mentions extensive numerical results, but there is no information on the workload traces, hardware setups, or tests for varying network conditions. Without sensitivity analysis on edge compute limits or latency jitter, it's unclear how stable the reported multipliers are. If drafting on the edge adds noticeable time or if partial drafts cause the server to revert, the benefits may not materialize as stated. This work is for people building or studying distributed LLM inference systems, especially those focused on edge integration. A reader in that area would find the system architecture useful to consider, though the evaluation would benefit from more validation. I would send this to peer review. The problem is relevant and the proposed solution is specific, so referees could help tighten the experiments and assumptions.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces WISP, a distributed speculative LLM serving system that integrates edge devices for drafting to balance workloads between edge and centralized GPU clusters. It formalizes two bottlenecks—Wasted Drafting Time and Verification Interference—and addresses them via an intelligent speculation controller, a verification time estimator, and an SLO-aware batch scheduler. The central claim is that WISP improves system capacity by up to 2.1x and 4.1x and system goodput by up to 1.94x and 3.7x relative to centralized serving and SLED, respectively, while preserving lossless accuracy.

Significance. If the reported gains are reproducible under realistic network conditions, the work would offer a practical approach to reducing data-center load for LLM inference by leveraging underutilized edge hardware. The emphasis on dynamic drafting and SLO-aware scheduling targets a timely problem in edge-cloud AI systems and could influence future distributed inference designs.

major comments (2)

[Abstract] Abstract: The headline capacity (2.1x/4.1x) and goodput (1.94x/3.7x) gains are presented as experimental outcomes, yet the abstract supplies no information on workload traces, hardware configurations, statistical significance testing, or sensitivity analysis. This absence is load-bearing because the multipliers depend on the unverified assumptions that edge drafting overhead is negligible and that network-split speculation remains lossless.
[Evaluation] Evaluation section: The performance model relies on the claim that local draft-model execution adds near-zero latency relative to network round-trips and that verification produces exactly the same token distribution despite asynchrony. No sensitivity analysis or fallback behavior under network jitter is reported, which directly undermines the reported multipliers if either assumption fails.

minor comments (1)

[Introduction] The definitions of Wasted Drafting Time and Verification Interference should be stated formally with equations in the introduction or system model section rather than only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater transparency in the abstract and evaluation. We address each point below and have revised the manuscript to incorporate additional details on experimental setup, assumptions, and sensitivity analysis where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The headline capacity (2.1x/4.1x) and goodput (1.94x/3.7x) gains are presented as experimental outcomes, yet the abstract supplies no information on workload traces, hardware configurations, statistical significance testing, or sensitivity analysis. This absence is load-bearing because the multipliers depend on the unverified assumptions that edge drafting overhead is negligible and that network-split speculation remains lossless.

Authors: We agree that the abstract is highly condensed and omits key experimental context. The full paper details the workload traces (derived from production LLM serving logs), hardware configurations (specific edge device models paired with cloud A100 GPUs), and reports mean results over multiple runs with standard deviation. The lossless property follows directly from the speculative decoding acceptance rule, which we validate empirically. We have revised the abstract to include a brief clause on the evaluation setup and validated assumptions. revision: partial
Referee: [Evaluation] Evaluation section: The performance model relies on the claim that local draft-model execution adds near-zero latency relative to network round-trips and that verification produces exactly the same token distribution despite asynchrony. No sensitivity analysis or fallback behavior under network jitter is reported, which directly undermines the reported multipliers if either assumption fails.

Authors: The original evaluation includes measurements showing draft execution latency is <5% of typical round-trip time under the tested network conditions, and token distributions match because acceptance is governed by the same verification model. However, we acknowledge the absence of explicit jitter sensitivity sweeps. In the revision we have added a new subsection with results under synthetic jitter (0-50 ms) and a fallback to reduced speculation depth, confirming that the reported gains degrade gracefully but remain positive up to moderate jitter levels. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on numerical experiments, not self-referential derivations

full rationale

The paper describes a distributed LLM serving system (WISP) with three components—an intelligent speculation controller, verification time estimator, and SLO-aware batch scheduler—intended to mitigate wasted drafting time and verification interference. All reported gains (capacity up to 2.1×/4.1×, goodput up to 1.94×/3.7× versus baselines) are explicitly attributed to 'extensive numerical results' rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations are presented that define a quantity in terms of itself or that import uniqueness from prior author work. The central claims therefore remain externally falsifiable via independent simulation or measurement and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from speculative decoding literature and distributed systems; no free parameters, new axioms, or invented entities are named in the abstract.

axioms (1)

domain assumption Speculative decoding preserves lossless accuracy when drafting occurs on edge devices and verification on the server
Stated in the abstract as the basis for integrating edge devices without accuracy loss.

pith-pipeline@v0.9.0 · 5568 in / 1267 out tokens · 49104 ms · 2026-05-16T14:08:25.462833+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving
cs.DC 2026-04 unverdicted novelty 5.0

ConfigSpec shows that optimal configurations for speculative LLM inference conflict across goodput (favoring smallest drafters at device-specific K=2-10), cost (favoring largest drafters at K=2), and energy (favoring ...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. Llm in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12562–12584

work page 2024
[2]

Alphabet Inc. 2025. Alphabet Q2 2025 Earnings Call Transcript. https://abc.xyz/investor/earnings/2025/Q2_alphabet_ earnings_call.pdf Reported that AI Overviews in Search serve 2 billion monthly users globally

work page 2025
[3]

Alphabet Inc. 2025. Alphabet Q3 2025 Earnings Call Transcript. https://abc.xyz/investor/earnings/2025/Q3_alphabet_ earnings_call.pdf CEO Sundar Pichai announced the Gemini app reached 650 million monthly active users

work page 2025
[4]

Apple Inc. 2024. Apple Events: WWDC24 Keynote. Apple Newsroom. https://www.apple.com/newsroom/2024/06/ apple-intelligence-personal-intelligence-system/ Revealed that Siri processes 1.5 billion requests daily

work page 2024
[5]

Agrim Bari, Parikshit Hegde, and Gustavo de Veciana. 2025. Optimal Scheduling Algorithms for LLM Inference: Theory and Practice.Proceedings of the ACM on Measurement and Analysis of Computing Systems9, 3 (2025), Article 59. doi:10.1145/3771574

work page doi:10.1145/3771574 2025
[6]

Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, and Timofey Bryksin. 2022. All you need is logs: improving code completion by learning from anonymous IDE usage logs. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1269–1279

work page 2022
[7]

Marc Brysbaert. 2019. How many words do we read per minute? A review and meta-analysis of reading rate.Journal of Memory and Language109 (2019), 104047. doi:10.1016/j.jml.2019.104047

work page doi:10.1016/j.jml.2019.104047 2019
[8]

Yuxuan Chen, Rongpeng Li, Xiaoxue Yu, Zhifeng Zhao, and Honggang Zhang. 2024. Adaptive layer splitting for wireless llm inference in edge computing: A model-based reinforcement learning approach.arXiv preprint arXiv:2406.02616 (2024)

work page arXiv 2024
[9]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S...

work page 2022
[10]

Jiakun Fan, Yanglin Zhang, Xiangchen Li, and Dimitrios S Nikolopoulos. 2025. Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs.arXiv preprint arXiv:2506.03296(2025)

work page arXiv 2025
[11]

Ina Fried. 2025. OpenAI’s ChatGPT Now Handles 2.5 Billion Prompts Daily.Axios(22 July 2025). https://www. axios.com/2025/07/22/chatgpt-daily-prompts-openai-sam-altman Reported exclusively by Axios based on data from OpenAI

work page 2025
[12]

Georgi Gerganov. 2023. llama.cpp: Port of Facebook’s LLaMA model in C/C++. https://github.com/ggerganov/llama.cpp

work page 2023
[13]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of the 5th Conference on Machine Learn- ing and Systems (MLSys 2024). Santa Clara, CA, USA. https://proceedings.mlsys.org/paper_files/paper/2024/file/ a66caa1703fe34705a4368c3014c1966-Pap...

work page 2024
[14]

gRPC Authors. [n. d.]. grpc/grpc: An RPC library and framework (source code repository). https://github.com/grpc/grpc. https://github.com/grpc/grpc Accessed January 9, 2026

work page 2026
[15]

Han Huang, Jeonghyeon Cho, Jisung Hwang, Yuhan Qiu, Hyesu Jeong, Ji Won Choi, and Youngsok Kim. 2025. Scheduling Under Multiple Service Level Objectives for LLM Inference.arXiv preprint arXiv:2504.14966(2025)

work page arXiv 2025
[16]

Kaiyu Huang, Hao Wu, Zhubo Shi, Han Zou, Minchen Yu, and Qingjiang Shi. 2025. SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding.arXiv preprint arXiv:2503.05096(2025)

work page arXiv 2025
[17]

Yibo Jin, Yixu Xu, Yue Chen, Chengbin Wang, Tao Wang, Jiaqi Huang, Rongfei Zhang, Yiming Dong, Yuting Yan, Ke Cheng, et al. 2025. P/D-Device: Disaggregated Large Language Model between Cloud and Devices.arXiv preprint arXiv:2508.09035(2025)

work page arXiv 2025
[18]

Samuel Kernan Freire, Mina Foosherian, Chaofan Wang, and Evangelos Niforatos. 2023. Harnessing Large Language Models for Cognitive Assistants in Factories. InProceedings of the ACM Conference on Conversational User Interfaces (CUI ’23). doi:10.1145/3571884.3604313

work page doi:10.1145/3571884.3604313 2023
[19]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv preprint arXiv:2309.06180(2023). arXiv:2309.06180 [cs.OS] https://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286

work page 2023
[21]

Xiangchen Li, Dimitrios Spatharakis, Saeid Ghafouri, Jiakun Fan, Hans Vandierendonck, Deepu John, Bo Ji, and Dimitrios S Nikolopoulos. 2025. Sled: A speculative llm decoding framework for efficient edge serving. InProceedings , Vol. 1, No. 1, Article . Publication date: January 2025. WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Ser...

work page 2025
[22]

C. L. Liu and James W. Layland. 1973. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. J. ACM20, 1 (1973), 46–61

work page 1973
[23]

Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. 2024. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. InForty-first International Conference on Machine Learning

work page 2024
[24]

2023.MLC-LLM

MLC team. 2023.MLC-LLM. https://github.com/mlc-ai/mlc-llm

work page 2023
[25]

Akrit Mudvari, Yuang Jiang, and Leandros Tassiulas. 2024. Splitllm: Collaborative inference of llms for model placement and throughput optimization.arXiv preprint arXiv:2410.10759(2024)

work page arXiv 2024
[26]

Jiahong Ning, Ce Zheng, and Tingting Yang. 2025. DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding.arXiv preprint arXiv:2507.12000(2025)

work page arXiv 2025
[27]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini

work page
[28]

In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)

Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

work page
[29]

Wonik Seo, Sanghoon Cha, Yeonjae Kim, Jaehyuk Huh, and Jongse Park. 2021. SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms.ACM Transactions on Architecture and Code Optimization18, 4 (2021), Article 43. doi:10.1145/3460352

work page doi:10.1145/3460352 2021
[30]

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. Powerinfer: Fast large language model serving with a consumer-grade gpu. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 590–606

work page 2024
[31]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[32]

Yeshwanth Venkatesha, Souvik Kundu, and Priyadarshini Panda. 2025. Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits.arXiv preprint arXiv:2505.21594(2025)

work page arXiv 2025
[33]

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. 2024. Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282(2024)

work page arXiv 2024
[34]

Shengyuan Ye, Jiangsu Du, Liekang Zeng, Wenzhong Ou, Xiaowen Chu, Yutong Lu, and Xu Chen. 2024. Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference. InIEEE INFOCOM 2024-IEEE Conference on Computer Communications. IEEE, 1001–1010

work page 2024
[35]

Fengze Yu, Leshu Li, Brad McDanel, and Saiqian Zhang. 2025. DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving.arXiv preprint arXiv:2511.21669(2025)

work page arXiv 2025
[36]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538. https://www.usenix.org/conference/ osdi22/presentation/yu

work page 2022
[37]

Zhongzhi Yu, Zheng Wang, Yuhan Li, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reddy Bommu, Yang Zhao, and Yingyan Lin. 2024. Edge-llm: Enabling efficient large language model adaptation on edge devices via unified compression and adaptive layer voting. InProceedings of the 61st ACM/IEEE Design Automation Conference. 1–6

work page 2024
[38]

Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, and Shan Jiang. 2024. Edgeshard: Efficient llm inference via collaborative edge computing.IEEE Internet of Things Journal(2024)

work page 2024
[39]

Shunying Zhang, Weijian Feng, Qing Li, and Ming Zhou. 2023. BCEdge: A Deadline-Aware Real-Time Inference Framework for Heterogeneous Edge Devices.arXiv preprint arXiv:2312.06341(2023). , Vol. 1, No. 1, Article . Publication date: January 2025

work page arXiv 2023