Written by: Lauren Perkins September 15, 2021 Dr. Joshua Booth, Assistant Professor of the University of Alabama in Huntsville's Computer Science Department, was awarded his second National Science Foundation grant of 2021. The total intended award amount is just under two-hundred thousand dollars. Fault tolerance refers to the ability of a system to continue operating without interruption when one or more of its components fail. Booth's research, Learning Fault Tolerance at Scale, aims to address limitations of current fault tolerance by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This research could help develop automated approaches for applications which are widely used not only across multiple industrial sectors, but to also increase the predictive power of climate or weather models to aid critical decision making. In computer-aided design and analysis of engineered systems, such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit, one of the world's fastest supercomputers, located at the US Department of Energy's Oak Ridge National Laboratory in Oak Ridge, TN, increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. The investigators will work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.