pith. sign in

arxiv: 1303.5891 · v1 · pith:2CWZM37Ynew · submitted 2013-03-23 · 💻 cs.DC

Fault Tolerance in Distributed Systems using Fused State Machines

classification 💻 cs.DC
keywords machinesfusionstatebackupreplicationrequiresdfsmsfault
0
0 comments X
read the original abstract

Replication is a standard technique for fault tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct f crash or f/2 Byzantine faults among n different machines, replication requires nf additional backup machines. We present a solution called fusion that requires just f additional backup machines. First, we build a framework for fault tolerance in DFSMs based on the notion of Hamming distances. We introduce the concept of an (f,m)-fusion, which is a set of m backup machines that can correct f crash faults or f/2 Byzantine faults among a given set of machines. Second, we present an algorithm to generate an (f,f)-fusion for a given set of machines. We ensure that our backups are efficient in terms of the size of their state and event sets. Our evaluation of fusion on the widely used MCNC'91 benchmarks for DFSMs show that the average state space savings in fusion (over replication) is 38% (range 0-99%). To demonstrate the practical use of fusion, we describe its potential application to the MapReduce framework. Using a simple case study, we compare replication and fusion as applied to this framework. While a pure replication-based solution requires 1.8 million map tasks, our fusion-based solution requires only 1.4 million map tasks with minimal overhead during normal operation or recovery. Hence, fusion results in considerable savings in state space and other resources such as the power needed to run the backup tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.