Predictive Bayesian Arbitration: A Scalable Noisy-OR Model with Service Criticality Awareness

Anil Jangam; Ganesh Karthick Rajendran; Roy Kantharajah

arxiv: 2604.11989 · v1 · submitted 2026-04-13 · 💻 cs.DC

Predictive Bayesian Arbitration: A Scalable Noisy-OR Model with Service Criticality Awareness

Anil Jangam , Ganesh Karthick Rajendran , Roy Kantharajah This is my paper

Pith reviewed 2026-05-10 15:19 UTC · model grok-4.3

classification 💻 cs.DC

keywords predictive arbitrationBayesian Noisy-ORhigh availabilityfailure cascademicroservicesswitchover efficiencygeo-HA clusters

0 comments

The pith

A Bayesian Noisy-OR model learns failure cascades to cut detection time 60% and enable proactive switchovers in geo-HA clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a shared microservice arbitration framework for geographically high-available clusters that shifts from reactive heartbeat monitoring to predictive decisions. It relies on an adaptive Bayesian Noisy-OR model that discovers temporal dependencies in emergent failure patterns and refines expert priors automatically during operation. This setup reduces the infrastructure needed for arbitration while allowing switchovers to start before hard failures occur. The approach remains relevant because modern cloud systems depend on minimizing unplanned downtime across distributed services without adding per-domain overhead.

Core claim

The authors show that a Bayesian Noisy-OR model, initialized with expert priors and updated online, can autonomously identify cascade dependencies among microservice failures. When applied to Geo-HA arbitration, this produces a predictive lead time that yields a 60% reduction in mean time to failure detection and up to 77.8% higher switchover efficiency than traditional reactive methods, all while preserving O(n) computational complexity.

What carries the argument

The Bayesian Noisy-OR model, which computes the probability of an observed failure as a noisy logical disjunction over multiple independent causes, thereby exposing temporal cascade structure and supporting runtime prior refinement.

Load-bearing premise

Expert-informed priors can be refined automatically from observed failure patterns without introducing bias or requiring manual tuning.

What would settle it

Controlled experiments on a fresh workload where induced failure cascades produce no measurable reduction in mean time to failure detection and no improvement in switchover timing compared with heartbeat baselines.

Figures

Figures reproduced from arXiv: 2604.11989 by Anil Jangam, Ganesh Karthick Rajendran, Roy Kantharajah.

**Figure 3.** Figure 3: Component Breakdown: Detection/Prediction vs Exe [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Probability Evolution Over Time for CSG Services. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Event Timeline: Switchover Comparison for Event 3 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Geographically High-Available (Geo-HA) cluster systems are essential for service continuity in distributed cloud-native environments. However, traditional arbitration mechanisms, which are often predicated on deterministic node-level heartbeats, are resource-intensive and inherently reactive. This necessitates a dedicated arbiter per deployment and leads to reactive switchovers that incur unavoidable downtime, occurring only after a failure has already compromised the system. This paper presents a novel predictive arbitration framework that utilizes a shared, microservice-based architecture to consolidate arbitration logic across multiple Geo-HA domains, significantly reducing the aggregate infrastructure footprint. Central to our approach is an adaptive online learning mechanism grounded in a Bayesian Noisy-OR model that autonomously discovers and learns temporal cascade dependencies from emergent failure patterns. To overcome the "cold start" challenge, the system utilizes expert-informed priors that are dynamically refined at runtime without manual configuration. Experimental results demonstrate that this framework achieves a 60\% reduction in Mean Time to Failure Detection (MTTFD) and improves total switchover efficiency by up to 77.8\% compared to traditional reactive standards. By enabling a significant predictive lead time, the system allows switchovers to initiate proactively before hard failures occur, while maintaining a linear $O(n)$ computational complexity. This approach provides a scalable, context-aware alternative that bridges the performance-durability gap in modern microservice architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies an online Bayesian Noisy-OR model with expert priors to predictive arbitration in Geo-HA clusters and reports 60% MTTFD cuts plus efficiency gains, but the evaluation details are too thin to assess the claims.

read the letter

The main point is that this work takes the Noisy-OR Bayesian structure, already used in other domains, and puts it into a shared microservice arbiter for geographically distributed high-availability systems. The goal is to predict failure cascades ahead of time instead of waiting for heartbeats to fail, while folding in service criticality so the model does not treat every component the same. They handle the cold-start problem by starting with expert priors that get updated from live observations, and they claim the whole thing stays linear in complexity. If the numbers hold, the 60% drop in mean time to failure detection and the 77.8% switchover improvement would matter for anyone running critical cloud services that cannot afford long outages. The shared-arbiter design also looks like a practical way to reduce the usual per-deployment overhead. That combination of architecture and online updating is the clearest new piece here, even if the underlying model is not original. The approach is coherent on its own terms and does not contain obvious internal contradictions or impossible assumptions. The weakest part is the experimental support. The abstract states the quantitative gains but supplies no description of the test clusters, the exact reactive baselines, the number of failure scenarios, or any ablation that isolates the contribution of the temporal learning step. Without those, it is impossible to tell whether the reported lead time comes from genuine cascade discovery or from conditions that favor the model. The claim that the system autonomously learns dependencies without manual bias also needs concrete validation data that is not visible in the summary. This paper is aimed at engineers and researchers who build or operate distributed HA systems rather than at the broader theory community. A reader already working on cloud reliability tools could extract usable ideas about shared arbiters and criticality-weighted prediction even if the exact numbers require checking. The work shows clear thinking about the practical constraints of Geo-HA setups and engages honestly with the reactive limitations of current methods. It deserves a serious referee. The core architecture and the stated performance targets are specific enough that reviewers can ask targeted questions about the experiments and the prior-refinement process rather than starting from scratch.

Referee Report

2 major / 3 minor

Summary. The paper proposes a predictive arbitration framework for geographically high-available (Geo-HA) cluster systems in cloud-native environments. It replaces traditional reactive, heartbeat-based arbitration with a shared microservice architecture and an adaptive Bayesian Noisy-OR model that learns temporal failure cascades from observed patterns. Expert-informed priors are refined online at runtime to address cold-start issues. The central empirical claims are a 60% reduction in Mean Time to Failure Detection (MTTFD), up to 77.8% improvement in switchover efficiency versus reactive baselines, proactive switchover lead time, and O(n) computational complexity.

Significance. If the performance claims are substantiated by rigorous experiments, the work could meaningfully advance high-availability mechanisms by shifting from reactive to predictive arbitration while lowering infrastructure overhead through sharing. The combination of Bayesian online learning with service-criticality awareness addresses a practical gap in microservice-based Geo-HA deployments; reproducible validation of the claimed gains and complexity bound would strengthen its potential impact on distributed systems design.

major comments (2)

[Experimental Evaluation] Experimental Evaluation section: The manuscript asserts a 60% MTTFD reduction and 77.8% switchover-efficiency gain, yet supplies no description of the experimental design, workload traces, baseline implementations (e.g., standard heartbeat or quorum-based arbiters), number of runs, or statistical tests. Without these elements the quantitative claims cannot be evaluated or reproduced.
[Bayesian Noisy-OR Model] Bayesian Noisy-OR Model section: The assertion that the model 'autonomously discovers and learns temporal cascade dependencies' from emergent patterns while dynamically refining expert priors without manual bias or circularity is load-bearing for the predictive lead-time claim, but the text provides neither the precise update rule for the priors nor any validation (e.g., held-out failure traces or ablation on prior strength) demonstrating that the learned structure generalizes beyond the training observations.

minor comments (3)

[Model Description] The title highlights 'Service Criticality Awareness' but the manuscript does not specify how criticality weights are encoded in the Noisy-OR conditional probability tables or how they affect the arbitration decision threshold.
[Complexity Analysis] The O(n) complexity claim is stated without an accompanying derivation or pseudocode showing the per-observation update cost; a brief complexity analysis paragraph would clarify the linear bound.
[Notation] Notation for the Noisy-OR parameters (e.g., leak probability, causal strengths) is introduced without a consolidated table; a single reference table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental Evaluation section: The manuscript asserts a 60% MTTFD reduction and 77.8% switchover-efficiency gain, yet supplies no description of the experimental design, workload traces, baseline implementations (e.g., standard heartbeat or quorum-based arbiters), number of runs, or statistical tests. Without these elements the quantitative claims cannot be evaluated or reproduced.

Authors: We agree that the Experimental Evaluation section lacks the necessary details for reproducibility and rigorous evaluation. In the revised manuscript we will expand this section to describe the experimental design, the workload traces used, the concrete baseline implementations (including heartbeat and quorum-based arbiters), the number of runs, and the statistical tests applied to the reported gains. revision: yes
Referee: [Bayesian Noisy-OR Model] Bayesian Noisy-OR Model section: The assertion that the model 'autonomously discovers and learns temporal cascade dependencies' from emergent patterns while dynamically refining expert priors without manual bias or circularity is load-bearing for the predictive lead-time claim, but the text provides neither the precise update rule for the priors nor any validation (e.g., held-out failure traces or ablation on prior strength) demonstrating that the learned structure generalizes beyond the training observations.

Authors: We acknowledge that the current text does not supply the precise prior-update rule or the requested validation experiments. In the revised manuscript we will add the exact mathematical update rule for online refinement of the expert priors and include results from held-out failure traces together with ablation studies on prior strength to demonstrate generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and provided context describe a Bayesian Noisy-OR model that learns temporal dependencies from observed failure patterns using dynamically refined expert priors, with experimental results compared against traditional reactive standards. No equations, self-citations, or derivation steps are quoted that reduce any claimed prediction (such as the 60% MTTFD reduction) to a fitted input or prior by construction. The O(n) complexity and proactive switchover claims rest on the model's online learning capability rather than tautological re-use of evaluation data. This is the most common honest finding when no load-bearing self-referential step is exhibited.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the modeling assumption that failure cascades are well captured by a Noisy-OR structure and that expert priors can be refined online without manual intervention or overfitting.

free parameters (1)

expert-informed priors
Initial probability values supplied by domain experts to initialize the Bayesian model before runtime data arrives.

axioms (1)

domain assumption Temporal cascade dependencies among failures in Geo-HA clusters can be represented by a Bayesian Noisy-OR network.
This is the core modeling choice that enables autonomous discovery of failure patterns.

pith-pipeline@v0.9.0 · 5550 in / 1368 out tokens · 55582 ms · 2026-05-10T15:19:20.166667+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Hihooi: A database replicatio n middleware for scaling transactional databases consistently,

M. A. Georgiou and et al., “Hihooi: A database replicatio n middleware for scaling transactional databases consistently,” in 2022 IEEE IEEE Transactions on Knowledge and Data Engineering . IEEE, 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9068420

work page arXiv 2022
[2]

Problems and opportunities in training de ep learning software systems: An analysis of variance,

H. V . Pham, S. Qian, J. Wang, T. Lutellier, J. Rosenthal, L . Tan, Y . Y u, and N. Nagappan, “Problems and opportunities in training de ep learning software systems: An analysis of variance,” in 2020 35th IEEE/ACM International Conference on Automated Software Engineeri ng (ASE) , 2020, pp. 771–783

work page 2020
[3]

(n.d.) Suse linux enterprise high availability ex tension 12 sp5: Geo clustering for suse linux enterprise high availa bility extension

SUSE. (n.d.) Suse linux enterprise high availability ex tension 12 sp5: Geo clustering for suse linux enterprise high availa bility extension. Retrieved on [Insert Retrieval Date Here]. [Onl ine]. Available: https://documentation.suse.com/sle-ha/12-SP5/html/SLE-HA-all/cha-ha-geo-concept.html

work page
[4]

A design of decentralized dual mode redundant hot standby arbitration switchover logic and arc hitecture,

Y . Liu and et al., “A design of decentralized dual mode redundant hot standby arbitration switchover logic and arc hitecture,” in 2018 IEEE 4th International Conference on Computer and Communications (ICCC) . IEEE, 2018, pp. 1177–1181. [Online]. Available: https://ieeexplore.ieee.org/document/8401458

work page arXiv 2018
[5]

(n.d.) Red hat gluster storage 3.5 administrati on guide: Creating arbitrated replicated volumes

Red Hat. (n.d.) Red hat gluster storage 3.5 administrati on guide: Creating arbitrated replicated volumes. Retrieved on [Insert Retrieval Date Here]. [Online]. Available: https://docs.redhat.com/en/documentation/red hat gluster storage/3.5/html/administratio

work page
[6]

Automated data analytics and resou rce arbitration scheduling for containerized network functio ns,

A. Sharma and et al., “Automated data analytics and resou rce arbitration scheduling for containerized network functio ns,” in 2023 IEEE 10th International Conference on Data Science and Adva nced Analytics (DSAA) . IEEE, 2023, pp. 1–10. [Online]. Available: https://ieeexplore.ieee.org/document/10056679

work page arXiv 2023
[7]

In search of an understanda ble consensus algorithm,

D. Ongaro and J. Ousterhout, “In search of an understanda ble consensus algorithm,” in 2014 USENIX Annual Technical Conference (USENIX ATC 14) , Philadelphia, PA, 2014, pp. 305–319

work page 2014
[8]

The part-time parliament,

L. Lamport, “The part-time parliament,” ACM Transactions on Computer Systems (TOCS) , vol. 16, no. 2, pp. 133–169, 1998

work page 1998
[9]

Predictive failure analysis in cloud systems using machi ne learning,

T. Das et al., “Predictive failure analysis in cloud systems using machi ne learning,” IEEE Transactions on Reliability , vol. 67, no. 2, pp. 512–524, 2018

work page 2018
[10]

Detecting cascading failures in mi croservices,

X. Zhou and X. Peng, “Detecting cascading failures in mi croservices,” in IEEE International Conference on W eb Services , 2021

work page 2021
[11]

Online learning of baye sian network parameters,

N. Friedman and M. Goldszmidt, “Online learning of baye sian network parameters,” Machine Learning , vol. 50, pp. 95–126, 2003

work page 2003
[12]

Online structure learning for ba yesian net- works,

W. Lam and F. Bacchus, “Online structure learning for ba yesian net- works,” Artiﬁcial Intelligence , vol. 282, 2020. Preprint - not yet peer-reviewed. 6

work page 2020

[1] [1]

Hihooi: A database replicatio n middleware for scaling transactional databases consistently,

M. A. Georgiou and et al., “Hihooi: A database replicatio n middleware for scaling transactional databases consistently,” in 2022 IEEE IEEE Transactions on Knowledge and Data Engineering . IEEE, 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9068420

work page arXiv 2022

[2] [2]

Problems and opportunities in training de ep learning software systems: An analysis of variance,

H. V . Pham, S. Qian, J. Wang, T. Lutellier, J. Rosenthal, L . Tan, Y . Y u, and N. Nagappan, “Problems and opportunities in training de ep learning software systems: An analysis of variance,” in 2020 35th IEEE/ACM International Conference on Automated Software Engineeri ng (ASE) , 2020, pp. 771–783

work page 2020

[3] [3]

(n.d.) Suse linux enterprise high availability ex tension 12 sp5: Geo clustering for suse linux enterprise high availa bility extension

SUSE. (n.d.) Suse linux enterprise high availability ex tension 12 sp5: Geo clustering for suse linux enterprise high availa bility extension. Retrieved on [Insert Retrieval Date Here]. [Onl ine]. Available: https://documentation.suse.com/sle-ha/12-SP5/html/SLE-HA-all/cha-ha-geo-concept.html

work page

[4] [4]

A design of decentralized dual mode redundant hot standby arbitration switchover logic and arc hitecture,

Y . Liu and et al., “A design of decentralized dual mode redundant hot standby arbitration switchover logic and arc hitecture,” in 2018 IEEE 4th International Conference on Computer and Communications (ICCC) . IEEE, 2018, pp. 1177–1181. [Online]. Available: https://ieeexplore.ieee.org/document/8401458

work page arXiv 2018

[5] [5]

(n.d.) Red hat gluster storage 3.5 administrati on guide: Creating arbitrated replicated volumes

Red Hat. (n.d.) Red hat gluster storage 3.5 administrati on guide: Creating arbitrated replicated volumes. Retrieved on [Insert Retrieval Date Here]. [Online]. Available: https://docs.redhat.com/en/documentation/red hat gluster storage/3.5/html/administratio

work page

[6] [6]

Automated data analytics and resou rce arbitration scheduling for containerized network functio ns,

A. Sharma and et al., “Automated data analytics and resou rce arbitration scheduling for containerized network functio ns,” in 2023 IEEE 10th International Conference on Data Science and Adva nced Analytics (DSAA) . IEEE, 2023, pp. 1–10. [Online]. Available: https://ieeexplore.ieee.org/document/10056679

work page arXiv 2023

[7] [7]

In search of an understanda ble consensus algorithm,

D. Ongaro and J. Ousterhout, “In search of an understanda ble consensus algorithm,” in 2014 USENIX Annual Technical Conference (USENIX ATC 14) , Philadelphia, PA, 2014, pp. 305–319

work page 2014

[8] [8]

The part-time parliament,

L. Lamport, “The part-time parliament,” ACM Transactions on Computer Systems (TOCS) , vol. 16, no. 2, pp. 133–169, 1998

work page 1998

[9] [9]

Predictive failure analysis in cloud systems using machi ne learning,

T. Das et al., “Predictive failure analysis in cloud systems using machi ne learning,” IEEE Transactions on Reliability , vol. 67, no. 2, pp. 512–524, 2018

work page 2018

[10] [10]

Detecting cascading failures in mi croservices,

X. Zhou and X. Peng, “Detecting cascading failures in mi croservices,” in IEEE International Conference on W eb Services , 2021

work page 2021

[11] [11]

Online learning of baye sian network parameters,

N. Friedman and M. Goldszmidt, “Online learning of baye sian network parameters,” Machine Learning , vol. 50, pp. 95–126, 2003

work page 2003

[12] [12]

Online structure learning for ba yesian net- works,

W. Lam and F. Bacchus, “Online structure learning for ba yesian net- works,” Artiﬁcial Intelligence , vol. 282, 2020. Preprint - not yet peer-reviewed. 6

work page 2020