SWARM+: Scalable and Resilient Multi-Agent Consensus for Decentralized Data-Aware Workload Management
read the original abstract
Distributed scientific workflows are increasingly executed across heterogeneous and geo-distributed computing environments, where centralized workload orchestration becomes a scalability and resilience bottleneck. This paper presents SWARM+, a decentralized workload management system that coordinates workload placement through hierarchical multi-agent consensus, reducing coordination overhead and dramatically improving scalability, while tolerating failures and dynamic membership changes. SWARM+ enables data-aware scheduling policies that incorporate resource availability, data transfer node (DTN) connectivity, and data locality into workload placement decisions. We evaluate SWARM+ on the distributed FABRIC testbed using heterogeneous scientific workloads derived from production workflow traces obtained from the Pegasus Workflow Management System (WMS). Experimental results show that SWARM+ scales coordination to 990 distributed agents with sub-second per-job selection time with 110 agents. SWARM+ demonstrates balanced workload distribution, maintains over $95\%$ job completion under distributed failures with graceful degradation during correlated site outages, tolerates coordinator agent failures gracefully, improves schedule quality by employing data-aware policies, and reduces both selection time and scheduling latency by $97$--$98\%$ when compared to the prior SWARM system.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.