arxiv: 2605.05527 · v1 · submitted 2026-05-07 · 💻 cs.DC

Recognition: unknown

EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge

Jiahe Cao , Xiaomeng Li , Qiang Liu , Tao Han , Ning Zhang , Weisong Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:42 UTC · model grok-4.3

classification 💻 cs.DC

keywords edge computingDNN servingmulti-DNNSLO violationearly exitGPU schedulingdeadline-awaretail latency

0 comments

The pith

EdgeServing schedules multiple DNNs on edge GPUs by picking models, exit points, and batch sizes with a stability score to cut deadline violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EdgeServing to handle concurrent deep neural network inference on shared edge GPUs while respecting per-request deadlines. It replaces local heuristics with time-division GPU access and early-exit options that enlarge the set of possible actions at each scheduling step. A stability score estimates how any candidate choice of model, exit point, and batch size will affect the future status of all queues, and the system selects the combination that minimizes predicted system-wide SLO impact. This matters because existing schedulers either overlook cross-queue effects or trade away latency predictability by sharing the GPU in space.

Core claim

EdgeServing shows that early-exit inference combined with a stability score lets the scheduler choose, at runtime, the model, exit point, and batch size that together minimize the forecasted SLO violations across all concurrent queues. On multiple hardware platforms the resulting system records lower SLO violation ratios and better P95 latencies than representative baselines, with the gains attributed to the expanded action space early exits provide under tight constraints.

What carries the argument

A stability score that quantifies the future impact of each scheduling decision on queue status, used together with early-exit points to expand the space of feasible inference actions.

Load-bearing premise

The stability score accurately predicts how each choice will change future queue lengths and deadline misses, and early-exit points keep model accuracy high enough for the target applications.

What would settle it

If the same workloads and models on the tested hardware produce equal or higher SLO violation ratios once the stability score or early-exit choices are removed, the performance advantage would not be supported.

Figures

Figures reproduced from arXiv: 2605.05527 by Jiahe Cao, Ning Zhang, Qiang Liu, Tao Han, Weisong Shi, Xiaomeng Li.

**Figure 1.** Figure 1: System overview of EdgeServing. This ensures deterministic inference latency, which can be accurately characterized through offline profiling. Challenge 2: System-Wide Queue Coupling under TimeDivision GPU Sharing. Under time-division sharing, when the GPU is allocated to one model, the request queues of all other models continue to accumulate. Existing scheduling policies (e.g., longest-queue-first) that… view at source ↗

**Figure 2.** Figure 2: Profiled average inference latency vs. batch size for all models and view at source ↗

**Figure 3.** Figure 3: Scheduler workflow of EdgeServing. the maximum allowed batch size. Second, we denote wm,max as the maximum queue time among all queue times in Qm. Given the optimal batch size B∗ , we further formulate the subproblem of determining the early-exit point as e ∗ = arg max e∈E e s.t. wm,max + L(m, e, B∗ ) ≤ τ. (6) Here, satisfying the constraint wm,max + L(m, e, B∗ ) ≤ τ will guarantee that no task will violat… view at source ↗

**Figure 4.** Figure 4: P95 latency and SLO violation ratio vs. traffic intensity for schedulers. view at source ↗

**Figure 5.** Figure 5: Average early-exit depth vs. traffic intensity. view at source ↗

**Figure 6.** Figure 6: Accuracy vs. P95 latency for the scheduler at different traffic view at source ↗

**Figure 7.** Figure 7: Impact of exit point configuration on latency and SLO violation ratio. view at source ↗

**Figure 11.** Figure 11: P95 latency and SLO violation ratio for ablation study. view at source ↗

**Figure 10.** Figure 10: Accuracy and latency trade-off under different hardware. view at source ↗

read the original abstract

As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing schedulers rely on local heuristics and fail to capture this global impact, while GPU spatial-sharing approaches sacrifice latency predictability. In this paper, we propose EdgeServing, a deadline-aware multi-DNN serving system for edge devices. EdgeServing adopts time-division GPU sharing with early-exit inference for high inference predictability, and introduces a stability score to quantify how each candidate scheduling decision impacts the future queue status. At runtime, it cohesively selects the model, exit point, and batch size to minimize predicted system-wide SLO impact. Experimental results on multiple hardware platforms show that EdgeServing consistently outperforms representative baselines in both SLO violation ratio and P95 latency, enabled by early-exit mechanism, which expands the scheduling action space under tight latency constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EdgeServing adds a stability score to pick models, exits, and batches for multi-DNN edge serving, but the score's predictive accuracy for queue impact is not clearly shown.

read the letter

EdgeServing's main contribution is a runtime policy that jointly picks the model, early-exit point, and batch size using a stability score meant to estimate each choice's effect on future queue status and overall SLO violations. It runs under time-division GPU sharing plus early exits to keep latency more predictable than spatial sharing approaches. The abstract says this beats representative baselines on SLO violation ratio and P95 latency across hardware platforms, with early exits expanding the feasible action space under tight constraints. That combination addresses a real pain point for shared edge hardware where multiple models run concurrently and tail latency matters. The stability score is presented as the piece that moves beyond local heuristics by trying to capture global downstream effects. The paper does a reasonable job framing the problem and showing a concrete system that integrates these pieces into one selector. Experimental claims on multiple platforms are at least directionally useful for practitioners who need to meet deadlines on limited GPUs. The soft spots are in the evaluation. The abstract supplies no details on baseline definitions, how the stability score is computed, any ablation isolating its contribution, or statistical checks on the results. The stress-test note is on point: without direct evidence that the score reliably predicts queue status changes, it is hard to know whether the reported gains come from accurate foresight or simply from the added flexibility of early exits. If the full paper has those ablations and controls, the claims strengthen; otherwise the central mechanism rests on unverified assumptions. This is a systems paper for people building or tuning edge inference runtimes. Readers working on multi-model serving, latency SLOs, or GPU sharing would find the scheduling approach and early-exit integration worth looking at, even if the novelty is incremental. It deserves a serious referee because it tackles a deployment-relevant problem with a working prototype and hardware results, though revisions would likely be needed to tighten the evaluation and validate the score. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes EdgeServing, a deadline-aware multi-DNN serving system for edge devices. It employs time-division GPU sharing combined with early-exit inference to improve latency predictability, and introduces a stability score that quantifies the predicted system-wide impact of each scheduling choice (model, exit point, batch size) on future queue status and SLO violations. At runtime, the system selects the combination that minimizes the predicted SLO impact. Experiments on multiple hardware platforms are claimed to show consistent outperformance over representative baselines in SLO violation ratio and P95 latency.

Significance. If the experimental claims hold under rigorous validation, the work could be significant for edge computing by addressing the global effects of scheduling decisions in shared-GPU multi-DNN serving, where local heuristics fall short. The early-exit mechanism's expansion of the action space under tight constraints is a practical contribution, and the stability score offers a potential way to achieve more cohesive minimization of tail latency effects.

major comments (2)

[§5 (Evaluation)] §5 (Evaluation): The central claim that EdgeServing's outperformance in SLO violation ratio and P95 latency is enabled by the stability score requires direct evidence that this score accurately predicts future queue status impact. No correlation analysis, ablation isolating the score's predictive fidelity, or comparison against alternatives is described, leaving open whether decisions are driven by accurate foresight or other unstated factors.
[Abstract and §4 (Design)] Abstract and §4 (Design): The stability score is presented as the key mechanism for cohesive minimization of predicted SLO impact under time-division sharing. However, without explicit validation (e.g., how well its predictions correlate with observed queue evolution or SLO outcomes across workloads), the attribution of performance gains to this component rather than the early-exit expansion alone cannot be confirmed.

minor comments (2)

[Abstract] The abstract asserts 'consistent outperformance on multiple platforms' but does not define the exact baselines, workload characteristics, or statistical tests used; this should be clarified in the introduction or evaluation summary for readability.
[§4 (Design)] Notation for the stability score and its inputs (e.g., how queue status is modeled) should be introduced earlier with a clear equation or pseudocode to aid understanding of the runtime selection logic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of the stability score. We address the major comments point-by-point below and will incorporate the suggested analyses in the revised manuscript.

read point-by-point responses

Referee: [§5 (Evaluation)] §5 (Evaluation): The central claim that EdgeServing's outperformance in SLO violation ratio and P95 latency is enabled by the stability score requires direct evidence that this score accurately predicts future queue status impact. No correlation analysis, ablation isolating the score's predictive fidelity, or comparison against alternatives is described, leaving open whether decisions are driven by accurate foresight or other unstated factors.

Authors: We agree that the manuscript lacks explicit correlation analysis or ablation isolating the stability score's predictive accuracy. The current §5 reports end-to-end gains but does not directly validate the score's foresight. In revision, we will add an ablation comparing full EdgeServing against a variant using the same early-exit action space but with random or local-heuristic selection. We will also include correlation plots and metrics between predicted stability scores and observed queue evolution/SLO outcomes across workloads. This will provide the requested direct evidence. revision: yes
Referee: [Abstract and §4 (Design)] Abstract and §4 (Design): The stability score is presented as the key mechanism for cohesive minimization of predicted SLO impact under time-division sharing. However, without explicit validation (e.g., how well its predictions correlate with observed queue evolution or SLO outcomes across workloads), the attribution of performance gains to this component rather than the early-exit expansion alone cannot be confirmed.

Authors: We acknowledge the attribution issue. The abstract emphasizes early-exit for action-space expansion under constraints, while §4 positions the stability score as the global decision mechanism. To clarify, we will revise the abstract to note both components and add a dedicated evaluation subsection with the ablations and correlation analysis described above. This will demonstrate that gains arise from the score's informed selection rather than early-exit alone. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is a systems proposal for EdgeServing that introduces a stability score and uses experimental evaluation on hardware platforms to demonstrate outperformance in SLO violation ratio and P95 latency. No equations, derivations, or first-principles results are present in the provided abstract or description, so there are no load-bearing steps that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The stability score is presented as a new construct to quantify scheduling impacts, with claims resting on empirical results rather than any renaming, smuggling via citation, or uniqueness imported from prior author work. This is the common case of an honest experimental systems paper that is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted with certainty.

pith-pipeline@v0.9.0 · 5468 in / 1018 out tokens · 38368 ms · 2026-05-08T05:42:17.673154+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 10 canonical work pages

[1]

Edge intelligence: Paving the last mile of artificial intelligence with edge computing,

Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019

2019
[2]

A survey on edge computing systems and tools,

F. Liu, G. Tang, Y . Li, Z. Cai, X. Zhang, and T. Zhou, “A survey on edge computing systems and tools,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1537–1562, 2019

2019
[3]

Orion: Interference-aware, fine- grained gpu sharing for ml applications,

F. Strati, X. Ma, and A. Klimovic, “Orion: Interference-aware, fine- grained gpu sharing for ml applications,” inProceedings of the Nine- teenth European Conference on Computer Systems, 2024, pp. 1075– 1092

2024
[4]

InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

K. K. W. Ng, H. M. Demoulin, and V . Liu, “Paella: Low-latency model serving with software-defined gpu scheduling,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 595–610. [Online]. Available: https://doi.org/10.1145/3600006.3613163

work page doi:10.1145/3600006.3613163 2023
[5]

Serving DNNs like clockwork: Performance predictability from the bottom up,

A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y . Vigfusson, and J. Mace, “Serving DNNs like clockwork: Performance predictability from the bottom up,” in14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, Nov. 2020, pp. 443–462. [Online]. Available: https://www.usenix.org/conference/osdi20/presen...

2020
[6]

INFaaS: Automated model-less inference serving,

F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “INFaaS: Automated model-less inference serving,” in2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, Jul. 2021, pp. 397–411. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/romero

2021
[7]

arXiv preprint arXiv:2308.07470 , year=

L. Chen, W. Deng, A. Canumalla, Y . Xin, D. Zhuo, M. Philipose, and A. Krishnamurthy, “Symphony: Optimized dnn model serving using deferred batch scheduling,” 2024. [Online]. Available: https://arxiv.org/abs/2308.07470

work page arXiv 2024
[8]

Shallow-deep networks: Understanding and mitigating network overthinking,

Y . Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking,” 2019. [Online]. Available: https://arxiv.org/abs/1810.07052

work page arXiv 2019
[9]

Spinn: synergistic progressive inference of neural networks over device and cloud,

S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “Spinn: synergistic progressive inference of neural networks over device and cloud,” inProceedings of the 26th Annual International Conference on Mobile Computing and Networking, ser. MobiCom ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: ht...

work page doi:10.1145/3372224.3419194 2020
[10]

Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference,

T. Tambe, C. Hooper, L. Pentecost, T. Jia, E.-Y . Yang, M. Donato, V . Sanh, P. N. Whatmough, A. M. Rush, D. Brooks, and G.-Y . Wei, “Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference,” 2021. [Online]. Available: https://arxiv.org/abs/2011.14203

work page arXiv 2021
[11]

Branchynet: Fast inference via early exiting from deep neural networks,

S. Teerapittayanon, B. McDanel, and H. T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” 2017. [Online]. Available: https://arxiv.org/abs/1709.01686

work page arXiv 2017
[12]

Apparate: Rethinking early exits to tame latency-throughput tensions in ml serving,

Y . Dai, R. Pan, A. Iyer, K. Li, and R. Netravali, “Apparate: Rethinking early exits to tame latency-throughput tensions in ml serving,” in Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 607–623

2024
[13]

Bert loses patience: Fast and robust inference with early exit,

W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei, “Bert loses patience: Fast and robust inference with early exit,” 2020. [Online]. Available: https://arxiv.org/abs/2006.04152

work page arXiv 2020
[14]

You need multiple exiting: Dynamic early exiting for accelerating unified vision language model,

S. Tang, Y . Wang, Z. Kong, T. Zhang, Y . Li, C. Ding, Y . Wang, Y . Liang, and D. Xu, “You need multiple exiting: Dynamic early exiting for accelerating unified vision language model,” 2023. [Online]. Available: https://arxiv.org/abs/2211.11152

work page arXiv 2023
[15]

NVIDIA Multi-Process Service (MPS),

NVIDIA, “NVIDIA Multi-Process Service (MPS),” https://docs.nvidia.com/deploy/mps/index.html, 2024

2024
[16]

Real-time, work-conserving gpu scheduling for concurrent dnn inference,

M. Han, R. Chen, W. Shen, H. Zhang, J. Yang, and H. Chen, “Real-time, work-conserving gpu scheduling for concurrent dnn inference,”ACM Trans. Comput. Syst., vol. 44, no. 1, Nov. 2025. [Online]. Available: https://doi.org/10.1145/3768622

work page doi:10.1145/3768622 2025
[17]

Clipper: A low-latency online prediction serving system,

D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in 14th USENIX Symposium on Networked Systems Design and Implemen- tation (NSDI 17), 2017, pp. 613–627

2017
[18]

TensorFlow-Serving: Flexible, High-Performance ML Serving

C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V . Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” 2017. [Online]. Available: https://arxiv.org/abs/1712.06139

work page Pith review arXiv 2017
[19]

NVIDIA Triton Inference Server,

NVIDIA, “NVIDIA Triton Inference Server,” https://developer.nvidia.com/triton-inference-server, 2024

2024
[20]

Proteus: A high-throughput inference-serving system with accuracy scaling,

S. Ahmad, H. Guan, B. D. Friedman, T. Williams, R. K. Sitaraman, and T. Woo, “Proteus: A high-throughput inference-serving system with accuracy scaling,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2024, pp. 318–334

2024
[21]

Cocktail: A multidimensional optimization for model serving in cloud,

J. R. Gunasekaran, C. S. Mishra, P. Thinakaran, B. Sharma, M. T. Kandemir, and C. R. Das, “Cocktail: A multidimensional optimization for model serving in cloud,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 1041–1057. [Online]. Available: https://www.usenix.org/conference...

2022