Recent News
New associate dean interested in helping students realize their potential
August 6, 2024
Hand and Machine Lab researchers showcase work at Hawaii conference
June 13, 2024
Two from School of Engineering to receive local 40 Under 40 awards
April 18, 2024
Making waves: Undergraduate combines computer science skills, love of water for summer internship
April 9, 2024
News Archives
[Colloquium] On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance
August 24, 2012
Watch Colloquium:
M4V file (330 MB)
- Date: Friday, August 24, 2012
- Time: 12:00 pm — 12:50 pm
- Place: Centennial Engineering Center 1041
Dewan Ibtesham
Department of Computer Science University of New Mexico
The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems; (2) checkpoint compression viability scales with checkpoint size; (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability; and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.
Bio: Dewan Ibtesham is a third year PhD student advised by Professor Dorian Arnold within the UNM Department of Computer Science. He received his bachelors degree in Computer Science and Engineering from BUET (Bangladesh University of Engineering Technology). After working two and a half years in the software industry, he moved to the U.S. and started graduate school beginning fall 2009. His research interests are generally in high performance computing and large scale distributed systems; in particular, making sure that the HPC systems are fault tolerant and reliable for users so that the full potential of the systems are properly utilized.