Recent News
New associate dean interested in helping students realize their potential
August 6, 2024
Hand and Machine Lab researchers showcase work at Hawaii conference
June 13, 2024
Two from School of Engineering to receive local 40 Under 40 awards
April 18, 2024
Making waves: Undergraduate combines computer science skills, love of water for summer internship
April 9, 2024
News Archives
[Colloquium] An Overview of HPC Resilience and an Approach to Soft Error Fault Injection
September 23, 2011
Watch Colloquium:
M4V file (682 MB)
- Date: Friday, September 23, 2011
- Time: 12:00 pm — 12:50 pm
- Place: Centennial Engineering Center 1041
Nathan DeBardeleben
Los Alamos National Laboratory
Over the next decade the field of high performance computing (supercomputing) will undoubtedly see major changes in the ways leadership class machines are built, used, and maintained. There are any number of challenges including operating systems, programming models and languages, power, and file systems to name but a few. This talk will focus on one of those challenges, the cross-cutting goal of providing reliable computation on fundamentally unreliable components. Nathan will provide an overview of the field of resilience and point to decadal obstacles, look at potential solutions that appear promising, and discuss areas that appear to need more emphasis. Nathan’s own new research on a soft error fault injection (SEFI) framework will be presented as will some early results. SEFI is intended as a framework for determining the resilience of a target application to soft errors. The initial implementation using a processor emulator virtual machine will be discussed as will reasons SEFI might be moving away to a dynamic instrumentation approach.
Bio: Nathan DeBardeleben is a research scientist at Los Alamos National Laboratory leading the HPC Resilience effort in the Ultrascale Systems Research Center (USRC). He joined LANL in 2004 after receiving his PhD, Master’s, and Bachelor’s in computer engineering from Clemson University. At LANL, Nathan was an early developer and designer of the Eclipse Parallel Tools Platform (PTP) project, spent several years optimizing application codes, and has since turned to focus on resilient computation. Nathan is active in the resilience community and spent 2010 on an IPA assignment at the U.S. Department of Defense where he lead the Resilience Thrust of the Advanced Computing Systems Research Program. Active on several program committees, Nathan leads the Fault-Tolerance at Extreme Scale Workshop. His own research interests are in the field of reliable computation, particularly the area of HPC resilience. This includes, but is not limited to, fault-tolerance, resilient programming models, resilient application design, and soft errors (particularly those transient in nature).
(Students with interests in Dr. DeBardeleben’s research wanting to meet with him over lunch should contact Dorian Arnold (darnold@cs.unm.edu) )