pith. machine review for the scientific record. sign in

arxiv: 1503.08877 · v1 · submitted 2015-03-31 · 💻 cs.DC

Recognition: unknown

Falkirk Wheel: Rollback Recovery for Dataflow Systems

Authors on Pith no claims yet
classification 💻 cs.DC
keywords rollbacklogicaleventsdifferenteffectsystemtimetimes
0
0 comments X
read the original abstract

We present a new model for rollback recovery in distributed dataflow systems. We explain existing rollback schemes by assigning a logical time to each event such as a message delivery. If some processors fail during an execution, the system rolls back by selecting a set of logical times for each processor. The effect of events at times within the set is retained or restored from saved state, while the effect of other events is undone and re-executed. We show that, by adopting different logical time "domains" at different processors, an application can adopt appropriate checkpointing schemes for different parts of its computation. We illustrate with an example of an application that combines batch processing with low-latency streaming updates. We show rules, and an algorithm, to determine a globally consistent state for rollback in a system that uses multiple logical time domains. We also introduce selective rollback at a processor, which can selectively preserve the effect of events at some logical times and not others, independent of the original order of execution of those events. Selective rollback permits new checkpointing policies that are particularly well suited to iterative streaming algorithms. We report on an implementation of our new framework in the context of the Naiad system.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.