Recent News
New associate dean interested in helping students realize their potential
August 6, 2024
Hand and Machine Lab researchers showcase work at Hawaii conference
June 13, 2024
Two from School of Engineering to receive local 40 Under 40 awards
April 18, 2024
Making waves: Undergraduate combines computer science skills, love of water for summer internship
April 9, 2024
News Archives
[Colloquium] Fault Resilience in Exascale Systems
October 8, 2010
- Date: Friday, October 8, 2010
- Time: 12noon — 12:50 pm
- Place: Centennial Engineering Center, Room 1041
Rolf Riesen
Rolf Riesen, Ph.D. Principal Member Technical Staff Scalable Computing Systems Sandia National Laboratories
Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost.
Redundant computing is a method to allow an application to continue working even when failures occur. Instead of each failure causing an application interrupt which causes lost work and requires restart time, multiple failures can be absorbed by the application until redundancy is exhausted. In this talk I will present a method to analyze the benefits of redundant computing, present simulation results of the cost, and discuss a prototype MPI implementation.
Bio: Rolf Riesen grew up in Switzerland and learned electronics there. He got interested in software because he got tired of burning his fingers on a soldering iron and got a master’s and a Ph.D. in computer science from the University of New Mexico (UNM). His advisor was Barney Maccabe who now leads the Computer Science and Mathematics Division CSM at Oak Ridge National Laboratory.
In 1990 he started working with a group at Sandia while he was a research assistant at UNM and, after finishing his Master’s, he was hired as a member of the technical staff in 1993. Throughout this time he designed, implemented, and debugged various pieces of system software starting with SUNMOS on the nCUBE 2 and Puma on the Intel Paragon. He created his own cluster, Cplant, before large clusters were common, and was involved in the Puma successors: Jaguar, Cougar, and Catamount for the Intel ASCI Red machine and the Cray XT3 Red Storm.