Risk-Aware and Stable Edge Server Selection Under Network Latency SLOs

Arnova Abdullah; Eldiyar Zhantileuov; Mohan Liyanage; Rolf Schuster

arxiv: 2604.21483 · v1 · submitted 2026-04-23 · 💻 cs.DC · cs.NI

Risk-Aware and Stable Edge Server Selection Under Network Latency SLOs

Mohan Liyanage , Arnova Abdullah , Eldiyar Zhantileuov , Rolf Schuster This is my paper

Pith reviewed 2026-05-08 13:57 UTC · model grok-4.3

classification 💻 cs.DC cs.NI

keywords edge computingserver selectionlatency SLOrisk estimationhysteresisnetwork latency predictionswitching stabilityCantelli bound

0 comments

The pith

A framework that scores edge servers by latency risk and stability cuts missed deadlines and switches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a lightweight method for choosing among edge servers when applications must meet strict latency targets. Instead of picking the server with the lowest average predicted latency, the method also estimates the risk that latency will exceed the target by combining mean and uncertainty predictions with a Normal approximation and a Cantelli bound on the tail. It further applies percentile scoring and hysteresis to avoid rapid switches when network conditions change briefly. On a real multi-server edge testbed with a 0.5-second SLO, the approach lowered the rate of missed deadlines from 39 percent to 34 percent and reduced switching frequency from 46 percent to 5.5 percent while keeping average latency at 0.45 seconds.

Core claim

By characterising each server with predictive mean and uncertainty summaries of network latency, estimating SLO-violation risk via a tight Normal approximation and conservative Cantelli bound, and stabilising choices with percentile-based scoring plus hysteresis, the framework reduces deadline-miss rate from 39 percent to 34 percent, switching frequency from 46 percent to 5.5 percent, and maintains sub-SLO average latency of approximately 0.45 seconds on a multi-server edge testbed under a 0.5-second SLO.

What carries the argument

Risk evaluation that combines latency mean and uncertainty summaries with a Normal approximation and Cantelli bound, paired with percentile scoring and hysteresis to stabilise selections.

If this is right

Explicit tail-risk scoring improves adherence to strict latency SLOs compared with mean-only selection.
Hysteresis control suppresses unnecessary server switches caused by short-lived network fluctuations.
The method keeps average latency safely below the SLO while improving both reliability and decision stability.
The framework remains lightweight and interpretable for practical deployment in dynamic edge settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better uncertainty estimates from improved predictors would tighten the risk bounds and potentially allow even stricter SLOs.
The stability mechanism could extend to other selection problems facing noisy observations, such as wireless channel or cloud instance choice.
Measuring end-to-end application metrics beyond raw latency would test whether the observed reductions translate to better user experience.
The modest drop in misses could matter in high-volume services where each violation affects many concurrent users.

Load-bearing premise

Predictive mean and uncertainty summaries of network latency are sufficiently accurate and the Normal approximation plus Cantelli bound reliably estimates the true tail risk of SLO violations.

What would settle it

Running the same multi-server edge testbed experiments and finding that the deadline-miss rate stays at or above 39 percent or that switching frequency remains near 46 percent would show the risk and stability components provide no measurable benefit.

Figures

Figures reproduced from arXiv: 2604.21483 by Arnova Abdullah, Eldiyar Zhantileuov, Mohan Liyanage, Rolf Schuster.

**Figure 1.** Figure 1: Interpretation of the risk-aversion parameter view at source ↗

**Figure 3.** Figure 3: Server selection over time under Algorithm 2. The hysteresis layer view at source ↗

**Figure 2.** Figure 2: Server selection over time under Algorithm 1. The selected server view at source ↗

**Figure 4.** Figure 4: Selected server index over time under Algorithm 2 in the containerlab view at source ↗

read the original abstract

We present a lightweight and interpretable decision framework for dynamic edge server selection in latency-critical applications that explicitly accounts for tail risk and switching stability. Each candidate server is characterised by predictive mean and uncertainty summaries of network latency, which are used to estimate the risk of service-level objective (SLO) violations and to guide selection. Risk is evaluated using a tight Normal approximation complemented by a conservative Cantelli bound, while percentile-based scoring coupled with hysteresis stabilizes decisions and suppresses oscillatory switching under short-lived network fluctuations. Experimental results on a multi-server edge testbed with a strict SLO of $\tau = 0.5$\,s show that the proposed approach reduces the deadline-miss rate from 39\% to 34\% compared to a mean-only baseline, while reducing switching frequency from 46\% to 5.5\% ($\approx$88\% reduction) and maintaining sub-SLO average latency ($\approx$0.45\,s). These results demonstrate that explicit risk evaluation combined with stability-preserving control enables practical and robust adaptive server selection in dynamic edge environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs a Normal-plus-Cantelli risk score with hysteresis to cut server switching sharply in edge selection, but the modest miss-rate gain may come more from stability control than from reliable tail-risk ranking.

read the letter

The one thing to know is that this work adds a risk term to server selection under a 0.5 s SLO and pairs it with hysteresis to limit switching. On their testbed the deadline-miss rate falls from 39 % to 34 % while switching drops from 46 % to 5.5 %, and average latency stays under the target. The risk score itself uses predictive mean and uncertainty fed into a Normal model plus a Cantelli bound; the stability piece is percentile-based and keeps decisions from flipping on short spikes.

Referee Report

1 major / 1 minor

Summary. The paper presents a lightweight, interpretable framework for dynamic edge server selection under strict network latency SLOs. Candidate servers are scored using predictive mean and uncertainty summaries of latency; risk of SLO violation is estimated via a Normal approximation augmented by a Cantelli bound, while percentile-based scoring and hysteresis are used to suppress switching under transient fluctuations. On a multi-server edge testbed with τ = 0.5 s, the approach is reported to reduce deadline-miss rate from 39 % to 34 % versus a mean-only baseline, cut switching frequency from 46 % to 5.5 %, and maintain sub-SLO average latency.

Significance. If the empirical gains are reproducible and attributable to the risk term, the work supplies a practical, low-overhead method for stable, risk-aware server selection in latency-critical edge applications. The combination of explicit tail-risk estimation with hysteresis addresses both performance and control-stability concerns that are central to real deployments.

major comments (1)

The headline experimental claim (deadline-miss reduction from 39 % to 34 % and switching reduction from 46 % to 5.5 %) rests on the risk scores correctly ranking servers by true P(latency > 0.5 s). The framework employs a Normal model plus Cantelli one-sided bound for this probability; however, network latencies are frequently heavy-tailed. Without a direct comparison of the estimated risk scores against empirical violation frequencies on the collected testbed traces, it remains possible that the observed improvements are driven primarily by the hysteresis component rather than by accurate tail-risk differentiation.

minor comments (1)

The abstract states that risk is evaluated using 'a tight Normal approximation complemented by a conservative Cantelli bound'; an explicit equation or pseudocode for the final risk score (mean, uncertainty, and bound combination) would improve reproducibility and allow readers to assess tightness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: The headline experimental claim (deadline-miss reduction from 39 % to 34 % and switching reduction from 46 % to 5.5 %) rests on the risk scores correctly ranking servers by true P(latency > 0.5 s). The framework employs a Normal model plus Cantelli one-sided bound for this probability; however, network latencies are frequently heavy-tailed. Without a direct comparison of the estimated risk scores against empirical violation frequencies on the collected testbed traces, it remains possible that the observed improvements are driven primarily by the hysteresis component rather than by accurate tail-risk differentiation.

Authors: We agree that a direct empirical validation of the risk scores would strengthen attribution of the gains to the risk term. Our risk estimator combines a Normal approximation with the Cantelli bound; the latter is a distribution-free one-sided inequality that holds for any distribution with finite variance and therefore remains valid (albeit conservative) even when latencies exhibit heavier tails than the Normal. This provides a principled, interpretable upper bound on SLO-violation probability without requiring parametric tail assumptions. Nevertheless, to address the referee's concern, we will add to the revised manuscript a calibration analysis that directly compares the estimated risk scores against the empirical frequencies of latency > 0.5 s observed on the collected testbed traces. This will include quantitative metrics (e.g., correlation or calibration plots) and will help isolate the contribution of the risk component from the hysteresis mechanism. We believe the addition will confirm that the risk-aware ranking meaningfully improves server selection beyond stability control alone. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and results are independently defined and externally validated

full rationale

The paper defines its risk evaluation using predictive mean/uncertainty summaries, a Normal approximation, and Cantelli bound as an explicit modeling choice, then evaluates the resulting policy via independent multi-server testbed experiments that measure deadline-miss rates and switching frequencies against a mean-only baseline. No equation or claim reduces to a fitted parameter renamed as a prediction, no self-citation is load-bearing for the core derivation, and the experimental outcomes are not forced by the model inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters and assumptions; the framework implicitly relies on a normality assumption for latency and on the accuracy of uncertainty estimates, but no explicit free parameters or invented entities are stated.

axioms (1)

domain assumption Network latency can be summarized by predictive mean and uncertainty that support a tight Normal approximation for risk estimation.
Invoked in the description of risk evaluation using Normal approximation and Cantelli bound.

pith-pipeline@v0.9.0 · 5493 in / 1301 out tokens · 41984 ms · 2026-05-08T13:57:47.846270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Lightweight latency prediction scheme for edge applications: A rational modelling approach,

M. Liyanage, E. Zhantileuov, A. K. Idrees, and R. Schuster, “Lightweight latency prediction scheme for edge applications: A rational modelling approach,” in2025 5th International Conference on Computer Systems (ICCS), 2025, pp. 115–119

work page 2025
[2]

Dynamic edge server selection in time-varying environments: A reliability-aware predictive approach,

J. S. Burbano, A. Abdullah, E. Zhantileuov, M. Liyanage, and R. Schuster, “Dynamic edge server selection in time-varying environments: A reliability-aware predictive approach,” 2025. [Online]. Available: https://arxiv.org/abs/2511.10146

work page arXiv 2025
[3]

Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action

M. Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013

work page 2013
[4]

Grimmett and D

G. Grimmett and D. Stirzaker,Probability and Random Processes. Oxford University Press, 2020

work page 2020
[5]

Mitzenmacher and E

M. Mitzenmacher and E. Upfal,Probability and Computing: Random- ization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, 2017

work page 2017
[6]

G. C. Buttazzo,Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Springer, 1997

work page 1997
[7]

Performance modeling and system manage- ment for multi-component online services,

C. Stewart and K. Shen, “Performance modeling and system manage- ment for multi-component online services,” inProceedings of the 2nd Conference on Symposium on Networked Systems Design & Implemen- tation - V olume 2, 2005, pp. 71–84

work page 2005
[8]

A comparative study for server selection schemes in multiserver mobile edge computing,

K. Aljobory and M. A. Yazici, “A comparative study for server selection schemes in multiserver mobile edge computing,” in9th International Conference on F og and Mobile Edge Computing, 2024, pp. 38–45

work page 2024
[9]

Slo-aware resource management for edge computing applica- tion,

P. Kang, “Slo-aware resource management for edge computing applica- tion,” Ph.D. dissertation, The University of Texas at San Antonio, 2024

work page 2024
[10]

Tail-learning: Adaptive learning method for mitigating tail latency in autonomous edge systems,

C. Zhang, Y . Deng, H. Zhao, T. Chen, and S. Deng, “Tail-learning: Adaptive learning method for mitigating tail latency in autonomous edge systems,”ACM Transactions on Autonomous and Adaptive Systems, vol. 20, no. 4, pp. 1–25, 2025

work page 2025
[11]

A unified learning- based optimization framework for 0-1 mixed problems in wireless networks,

K. Ma, Y . Sun, S. Hua, M. A. Imran, and W. Saad, “A unified learning- based optimization framework for 0-1 mixed problems in wireless networks,”IEEE Transactions on Communications, pp. 1–1, 2025

work page 2025
[12]

Beyer, C

B. Beyer, C. Jones, J. Petoffet al.,Site reliability engineering: how Google runs production systems. O’Reilly Media, Inc., 2016

work page 2016

[1] [1]

Lightweight latency prediction scheme for edge applications: A rational modelling approach,

M. Liyanage, E. Zhantileuov, A. K. Idrees, and R. Schuster, “Lightweight latency prediction scheme for edge applications: A rational modelling approach,” in2025 5th International Conference on Computer Systems (ICCS), 2025, pp. 115–119

work page 2025

[2] [2]

Dynamic edge server selection in time-varying environments: A reliability-aware predictive approach,

J. S. Burbano, A. Abdullah, E. Zhantileuov, M. Liyanage, and R. Schuster, “Dynamic edge server selection in time-varying environments: A reliability-aware predictive approach,” 2025. [Online]. Available: https://arxiv.org/abs/2511.10146

work page arXiv 2025

[3] [3]

Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action

M. Harchol-Balter,Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, 2013

work page 2013

[4] [4]

Grimmett and D

G. Grimmett and D. Stirzaker,Probability and Random Processes. Oxford University Press, 2020

work page 2020

[5] [5]

Mitzenmacher and E

M. Mitzenmacher and E. Upfal,Probability and Computing: Random- ization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, 2017

work page 2017

[6] [6]

G. C. Buttazzo,Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Springer, 1997

work page 1997

[7] [7]

Performance modeling and system manage- ment for multi-component online services,

C. Stewart and K. Shen, “Performance modeling and system manage- ment for multi-component online services,” inProceedings of the 2nd Conference on Symposium on Networked Systems Design & Implemen- tation - V olume 2, 2005, pp. 71–84

work page 2005

[8] [8]

A comparative study for server selection schemes in multiserver mobile edge computing,

K. Aljobory and M. A. Yazici, “A comparative study for server selection schemes in multiserver mobile edge computing,” in9th International Conference on F og and Mobile Edge Computing, 2024, pp. 38–45

work page 2024

[9] [9]

Slo-aware resource management for edge computing applica- tion,

P. Kang, “Slo-aware resource management for edge computing applica- tion,” Ph.D. dissertation, The University of Texas at San Antonio, 2024

work page 2024

[10] [10]

Tail-learning: Adaptive learning method for mitigating tail latency in autonomous edge systems,

C. Zhang, Y . Deng, H. Zhao, T. Chen, and S. Deng, “Tail-learning: Adaptive learning method for mitigating tail latency in autonomous edge systems,”ACM Transactions on Autonomous and Adaptive Systems, vol. 20, no. 4, pp. 1–25, 2025

work page 2025

[11] [11]

A unified learning- based optimization framework for 0-1 mixed problems in wireless networks,

K. Ma, Y . Sun, S. Hua, M. A. Imran, and W. Saad, “A unified learning- based optimization framework for 0-1 mixed problems in wireless networks,”IEEE Transactions on Communications, pp. 1–1, 2025

work page 2025

[12] [12]

Beyer, C

B. Beyer, C. Jones, J. Petoffet al.,Site reliability engineering: how Google runs production systems. O’Reilly Media, Inc., 2016

work page 2016