Asynchronous Failure Detectors

Unknown author (2013-01-30)

This report is superseded by MIT-CSAIL-TR-2013-025.

Failure detectors -- oracles that provide information about process crashes -- are an important abstraction for crash tolerance in distributed systems. The generality of failure-detector theory, while providing great expressiveness, poses significant challenges in developing a robust hierarchy of failure detectors. We address some of these challenges by proposing (1) a variant of failure detectors called asynchronous failure detectors and (2) an associated modeling framework. Unlike the traditional failure-detector framework, our framework eschews real-time completely. We show that asynchronous failure detectors are sufficiently expressive to include several popular failure detectors including, but not limited to, the canonical Chandra-Toueg failure detectors, Sigma and other quorum failure detectors, Omega, anti-Omega, Omega^k, and Psi_k. Additionally, asynchronous failure detectors satisfy many desirable properties: they are self-implementable, guarantee that stronger asynchronous failure-detectors solve harder problems, and ensure that their outputs encode no information other than the set of crashed processes. We introduce the notion of a failure detector being representative for a problem to capture the idea that some problems encode the same information about process crashes as their weakest failure detectors do. We show that a large class of problems, called bounded problems, do not have representative failure detectors. Finally, we use the asynchronous failure-detector framework to show how sufficiently strong AFDs circumvent the impossibility of consensus in asynchronous systems.