Recognition: unknown
EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
Pith reviewed 2026-05-08 05:42 UTC · model grok-4.3
The pith
EdgeServing schedules multiple DNNs on edge GPUs by picking models, exit points, and batch sizes with a stability score to cut deadline violations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EdgeServing shows that early-exit inference combined with a stability score lets the scheduler choose, at runtime, the model, exit point, and batch size that together minimize the forecasted SLO violations across all concurrent queues. On multiple hardware platforms the resulting system records lower SLO violation ratios and better P95 latencies than representative baselines, with the gains attributed to the expanded action space early exits provide under tight constraints.
What carries the argument
A stability score that quantifies the future impact of each scheduling decision on queue status, used together with early-exit points to expand the space of feasible inference actions.
Load-bearing premise
The stability score accurately predicts how each choice will change future queue lengths and deadline misses, and early-exit points keep model accuracy high enough for the target applications.
What would settle it
If the same workloads and models on the tested hardware produce equal or higher SLO violation ratios once the stability score or early-exit choices are removed, the performance advantage would not be supported.
Figures
read the original abstract
As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing schedulers rely on local heuristics and fail to capture this global impact, while GPU spatial-sharing approaches sacrifice latency predictability. In this paper, we propose EdgeServing, a deadline-aware multi-DNN serving system for edge devices. EdgeServing adopts time-division GPU sharing with early-exit inference for high inference predictability, and introduces a stability score to quantify how each candidate scheduling decision impacts the future queue status. At runtime, it cohesively selects the model, exit point, and batch size to minimize predicted system-wide SLO impact. Experimental results on multiple hardware platforms show that EdgeServing consistently outperforms representative baselines in both SLO violation ratio and P95 latency, enabled by early-exit mechanism, which expands the scheduling action space under tight latency constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EdgeServing, a deadline-aware multi-DNN serving system for edge devices. It employs time-division GPU sharing combined with early-exit inference to improve latency predictability, and introduces a stability score that quantifies the predicted system-wide impact of each scheduling choice (model, exit point, batch size) on future queue status and SLO violations. At runtime, the system selects the combination that minimizes the predicted SLO impact. Experiments on multiple hardware platforms are claimed to show consistent outperformance over representative baselines in SLO violation ratio and P95 latency.
Significance. If the experimental claims hold under rigorous validation, the work could be significant for edge computing by addressing the global effects of scheduling decisions in shared-GPU multi-DNN serving, where local heuristics fall short. The early-exit mechanism's expansion of the action space under tight constraints is a practical contribution, and the stability score offers a potential way to achieve more cohesive minimization of tail latency effects.
major comments (2)
- [§5 (Evaluation)] §5 (Evaluation): The central claim that EdgeServing's outperformance in SLO violation ratio and P95 latency is enabled by the stability score requires direct evidence that this score accurately predicts future queue status impact. No correlation analysis, ablation isolating the score's predictive fidelity, or comparison against alternatives is described, leaving open whether decisions are driven by accurate foresight or other unstated factors.
- [Abstract and §4 (Design)] Abstract and §4 (Design): The stability score is presented as the key mechanism for cohesive minimization of predicted SLO impact under time-division sharing. However, without explicit validation (e.g., how well its predictions correlate with observed queue evolution or SLO outcomes across workloads), the attribution of performance gains to this component rather than the early-exit expansion alone cannot be confirmed.
minor comments (2)
- [Abstract] The abstract asserts 'consistent outperformance on multiple platforms' but does not define the exact baselines, workload characteristics, or statistical tests used; this should be clarified in the introduction or evaluation summary for readability.
- [§4 (Design)] Notation for the stability score and its inputs (e.g., how queue status is modeled) should be introduced earlier with a clear equation or pseudocode to aid understanding of the runtime selection logic.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger validation of the stability score. We address the major comments point-by-point below and will incorporate the suggested analyses in the revised manuscript.
read point-by-point responses
-
Referee: [§5 (Evaluation)] §5 (Evaluation): The central claim that EdgeServing's outperformance in SLO violation ratio and P95 latency is enabled by the stability score requires direct evidence that this score accurately predicts future queue status impact. No correlation analysis, ablation isolating the score's predictive fidelity, or comparison against alternatives is described, leaving open whether decisions are driven by accurate foresight or other unstated factors.
Authors: We agree that the manuscript lacks explicit correlation analysis or ablation isolating the stability score's predictive accuracy. The current §5 reports end-to-end gains but does not directly validate the score's foresight. In revision, we will add an ablation comparing full EdgeServing against a variant using the same early-exit action space but with random or local-heuristic selection. We will also include correlation plots and metrics between predicted stability scores and observed queue evolution/SLO outcomes across workloads. This will provide the requested direct evidence. revision: yes
-
Referee: [Abstract and §4 (Design)] Abstract and §4 (Design): The stability score is presented as the key mechanism for cohesive minimization of predicted SLO impact under time-division sharing. However, without explicit validation (e.g., how well its predictions correlate with observed queue evolution or SLO outcomes across workloads), the attribution of performance gains to this component rather than the early-exit expansion alone cannot be confirmed.
Authors: We acknowledge the attribution issue. The abstract emphasizes early-exit for action-space expansion under constraints, while §4 positions the stability score as the global decision mechanism. To clarify, we will revise the abstract to note both components and add a dedicated evaluation subsection with the ablations and correlation analysis described above. This will demonstrate that gains arise from the score's informed selection rather than early-exit alone. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper is a systems proposal for EdgeServing that introduces a stability score and uses experimental evaluation on hardware platforms to demonstrate outperformance in SLO violation ratio and P95 latency. No equations, derivations, or first-principles results are present in the provided abstract or description, so there are no load-bearing steps that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The stability score is presented as a new construct to quantify scheduling impacts, with claims resting on empirical results rather than any renaming, smuggling via citation, or uniqueness imported from prior author work. This is the common case of an honest experimental systems paper that is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Edge intelligence: Paving the last mile of artificial intelligence with edge computing,
Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019
2019
-
[2]
A survey on edge computing systems and tools,
F. Liu, G. Tang, Y . Li, Z. Cai, X. Zhang, and T. Zhou, “A survey on edge computing systems and tools,”Proceedings of the IEEE, vol. 107, no. 8, pp. 1537–1562, 2019
2019
-
[3]
Orion: Interference-aware, fine- grained gpu sharing for ml applications,
F. Strati, X. Ma, and A. Klimovic, “Orion: Interference-aware, fine- grained gpu sharing for ml applications,” inProceedings of the Nine- teenth European Conference on Computer Systems, 2024, pp. 1075– 1092
2024
-
[4]
InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)
K. K. W. Ng, H. M. Demoulin, and V . Liu, “Paella: Low-latency model serving with software-defined gpu scheduling,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 595–610. [Online]. Available: https://doi.org/10.1145/3600006.3613163
-
[5]
Serving DNNs like clockwork: Performance predictability from the bottom up,
A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y . Vigfusson, and J. Mace, “Serving DNNs like clockwork: Performance predictability from the bottom up,” in14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, Nov. 2020, pp. 443–462. [Online]. Available: https://www.usenix.org/conference/osdi20/presen...
2020
-
[6]
INFaaS: Automated model-less inference serving,
F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “INFaaS: Automated model-less inference serving,” in2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, Jul. 2021, pp. 397–411. [Online]. Available: https://www.usenix.org/conference/atc21/presentation/romero
2021
-
[7]
arXiv preprint arXiv:2308.07470 , year=
L. Chen, W. Deng, A. Canumalla, Y . Xin, D. Zhuo, M. Philipose, and A. Krishnamurthy, “Symphony: Optimized dnn model serving using deferred batch scheduling,” 2024. [Online]. Available: https://arxiv.org/abs/2308.07470
-
[8]
Shallow-deep networks: Understanding and mitigating network overthinking,
Y . Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking,” 2019. [Online]. Available: https://arxiv.org/abs/1810.07052
-
[9]
Spinn: synergistic progressive inference of neural networks over device and cloud,
S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “Spinn: synergistic progressive inference of neural networks over device and cloud,” inProceedings of the 26th Annual International Conference on Mobile Computing and Networking, ser. MobiCom ’20. New York, NY , USA: Association for Computing Machinery, 2020. [Online]. Available: ht...
-
[10]
Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference,
T. Tambe, C. Hooper, L. Pentecost, T. Jia, E.-Y . Yang, M. Donato, V . Sanh, P. N. Whatmough, A. M. Rush, D. Brooks, and G.-Y . Wei, “Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference,” 2021. [Online]. Available: https://arxiv.org/abs/2011.14203
-
[11]
Branchynet: Fast inference via early exiting from deep neural networks,
S. Teerapittayanon, B. McDanel, and H. T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” 2017. [Online]. Available: https://arxiv.org/abs/1709.01686
-
[12]
Apparate: Rethinking early exits to tame latency-throughput tensions in ml serving,
Y . Dai, R. Pan, A. Iyer, K. Li, and R. Netravali, “Apparate: Rethinking early exits to tame latency-throughput tensions in ml serving,” in Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024, pp. 607–623
2024
-
[13]
Bert loses patience: Fast and robust inference with early exit,
W. Zhou, C. Xu, T. Ge, J. McAuley, K. Xu, and F. Wei, “Bert loses patience: Fast and robust inference with early exit,” 2020. [Online]. Available: https://arxiv.org/abs/2006.04152
-
[14]
You need multiple exiting: Dynamic early exiting for accelerating unified vision language model,
S. Tang, Y . Wang, Z. Kong, T. Zhang, Y . Li, C. Ding, Y . Wang, Y . Liang, and D. Xu, “You need multiple exiting: Dynamic early exiting for accelerating unified vision language model,” 2023. [Online]. Available: https://arxiv.org/abs/2211.11152
-
[15]
NVIDIA Multi-Process Service (MPS),
NVIDIA, “NVIDIA Multi-Process Service (MPS),” https://docs.nvidia.com/deploy/mps/index.html, 2024
2024
-
[16]
Real-time, work-conserving gpu scheduling for concurrent dnn inference,
M. Han, R. Chen, W. Shen, H. Zhang, J. Yang, and H. Chen, “Real-time, work-conserving gpu scheduling for concurrent dnn inference,”ACM Trans. Comput. Syst., vol. 44, no. 1, Nov. 2025. [Online]. Available: https://doi.org/10.1145/3768622
-
[17]
Clipper: A low-latency online prediction serving system,
D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in 14th USENIX Symposium on Networked Systems Design and Implemen- tation (NSDI 17), 2017, pp. 613–627
2017
-
[18]
TensorFlow-Serving: Flexible, High-Performance ML Serving
C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V . Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” 2017. [Online]. Available: https://arxiv.org/abs/1712.06139
work page Pith review arXiv 2017
-
[19]
NVIDIA Triton Inference Server,
NVIDIA, “NVIDIA Triton Inference Server,” https://developer.nvidia.com/triton-inference-server, 2024
2024
-
[20]
Proteus: A high-throughput inference-serving system with accuracy scaling,
S. Ahmad, H. Guan, B. D. Friedman, T. Williams, R. K. Sitaraman, and T. Woo, “Proteus: A high-throughput inference-serving system with accuracy scaling,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2024, pp. 318–334
2024
-
[21]
Cocktail: A multidimensional optimization for model serving in cloud,
J. R. Gunasekaran, C. S. Mishra, P. Thinakaran, B. Sharma, M. T. Kandemir, and C. R. Das, “Cocktail: A multidimensional optimization for model serving in cloud,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). Renton, W A: USENIX Association, Apr. 2022, pp. 1041–1057. [Online]. Available: https://www.usenix.org/conference...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.