ABSTRACT: In order to efficiently use future generations of supercomputers, fault tolerance and power usage are two of the prime challenges anticipated by the High Performance Computing (HPC) community. A significant share of faults in HPC systems constitute of hard failures, which in many cases lead to process(es) and eventually job failure. In this talk, we will present our fault tolerance approach developed in the scope of SPPEXA-ESSEX project. We have developed a Checkpoint/Restart and Automatic Fault Tolerance (CRAFT) library that serves two purposes. First, it provides a framework that significantly reduces the effort needed for the implementation of application-level checkpoint/restart methods in a program. The user can extend the library to add more user-specific data-types, making them ‘checkpointable’ for future use. Secondly, it provides an easier interface for dynamic process recovery, thus enabling applications to recover automatically after process failures. For this purpose, we have used User-Level Failure Mitigation (ULFM), which is a prototype implementation of fault tolerant MPI. We have significantly reduced the complexity of failure detection and application recovery mechanism. Both of these functionalities of CRAFT can either be used separate as well as combined. CRAFT-library features, optimizations and limitations will be discussed in detail.