ABSTRACT: In order to efficiently use future generations of supercomputers,
fault tolerance and power usage are two of the prime challenges anticipated by
the High Performance Computing (HPC) community. A significant share of faults
in HPC systems constitute of hard failures, which in many cases lead to
process(es) and eventually job failure.  
  In this talk, we will present our fault tolerance approach developed in the
scope of SPPEXA-ESSEX project. We have developed a Checkpoint/Restart and
Automatic Fault Tolerance (CRAFT) library that serves two purposes. First, it
provides a framework that significantly reduces the effort needed for the
implementation of application-level checkpoint/restart methods in a program.
The user can extend the library to add more user-specific data-types, making
them ‘checkpointable’ for future use. Secondly, it provides an easier
interface for dynamic process recovery, thus enabling applications to recover
automatically after process failures. For this purpose, we have used
User-Level Failure Mitigation (ULFM), which is a prototype implementation of
fault tolerant MPI. We have significantly reduced the complexity of failure
detection and application recovery mechanism. Both of these functionalities of
CRAFT can either be used separate as well as combined. CRAFT-library features,
optimizations and limitations will be discussed in detail.