SWARM+: Scalable and Resilient Multi-Agent Consensus for Decentralized Data-Aware Workload Management

Anirban Mandal; Ewa Deelman; Hamza Safri; Komal Thareja; Krishnan Raghavan

read the original abstract

Distributed scientific workflows are increasingly executed across heterogeneous and geo-distributed computing environments, where centralized workload orchestration becomes a scalability and resilience bottleneck. This paper presents SWARM+, a decentralized workload management system that coordinates workload placement through hierarchical multi-agent consensus, reducing coordination overhead and dramatically improving scalability, while tolerating failures and dynamic membership changes. SWARM+ enables data-aware scheduling policies that incorporate resource availability, data transfer node (DTN) connectivity, and data locality into workload placement decisions. We evaluate SWARM+ on the distributed FABRIC testbed using heterogeneous scientific workloads derived from production workflow traces obtained from the Pegasus Workflow Management System (WMS). Experimental results show that SWARM+ scales coordination to 990 distributed agents with sub-second per-job selection time with 110 agents. SWARM+ demonstrates balanced workload distribution, maintains over $95\%$ job completion under distributed failures with graceful degradation during correlated site outages, tolerates coordinator agent failures gracefully, improves schedule quality by employing data-aware policies, and reduces both selection time and scheduling latency by $97$--$98\%$ when compared to the prior SWARM system.

SWARM+: Scalable and Resilient Multi-Agent Consensus for Decentralized Data-Aware Workload Management

discussion (0)