Esteban Meneses
PhD Student
emenese2 at illinois.edu
Profile

I am a Research Assistant Professor in the Center for Simulation and Modeling (SaM) at the University of Pittsburgh. My main role in SaM is to foster the development of accelerator-based scientific applications. To achieve this goal we use a strategy with an educational and a research component. We train people on the use of accelerators and find ways to map the different scientific codes onto these architectures. This is my new webpage.

During my time in the PPL I worked mostly on fault tolerance. I was involved in several projects related to resilience in HPC, ranging from efficient checkpoint/restart mechanisms to understanding how failures and energy interplay. My thesis focused on scalable message-logging techniques. In particular, I developed a collection of strategies to reduce the memory overhead of the message log.

Research Areas
Papers
14-21
2014
[Paper]
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance [Cluster 2014]
14-20
2014
[Paper]
Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers [IEEE Transactions on Parallel and Distributed Systems 2014]
14-02
2014
[Paper]
Energy Profile of Rollback-Recovery Strategies in High Performance Computing [ParCo 2014]
13-60
2013
[Paper]
Position Paper: Actionable Performance Modeling for Future Supercomputers [MODSIM 2013]
13-25
2013
[Paper]
A ‘Cool’ Way of Improving the Reliability of HPC Machines [SC 2013]
13-24
2013
[Paper]
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection [SC 2013]
13-22
2013
[Paper]
Position Paper: A Multi-resolution Emulation + Simulation Methodology [MODSIM 2013]
13-17
2013
[PhD Thesis]
Scalable Message-Logging Techniques for Effective Fault Tolerance in HPC Applications [Thesis 2013]
12-37
2012
[Paper]
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems [SBAC-PAD 2012]
12-32
2012
[Paper]
Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm [Cluster 2012]
12-14
2012
[Paper]
A Message-Logging Protocol for Multicore Systems [FTXS 2012]
12-04
2012
[Paper]
A Scalable Double In-memory Checkpoint and Restart Scheme towards Exascale [PPL Technical Report 2012]
11-30
2011
[Paper]
Design and Analysis of a Message Logging Protocol for Fault Tolerant Multicore Systems [PPL Technical Report 2011]
11-26
2011
[Paper]
Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications [Cluster 2011]
11-04
2011
[Paper]
Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems [DPDNS 2011]
10-20
2010
[Paper]
Periodic Hierarchical Load Balancing for Large Supercomputers [IJHPCA 2010]
10-08
2010
[Paper]
Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers [P2S2 2010]
10-02
2010
[Paper]
Team-based Message Logging: Preliminary Results [Resilience 2010]
Talks/Posters
14-31
2014
[Talk]
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance [Cluster 2014]
13-50
2013
[Talk]
A ‘Cool’ Way of Improving the Reliability of HPC Machines [SC 2013]
12-43
2012
[Talk]
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems [SBAC-PAD 2012]
12-30
2012
[Talk]
A Message-Logging Protocol for Multicore Systems [FTXS 2012]
11-38
2011
[Talk]
Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications [Cluster 2011]
10-44
2010
[Talk]
Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers [P2S2 2010]
10-37
2010
[Talk]
Clustering Parallel Applications to Enhance Message Logging Protocols [PPL Talk 2010]
10-29
2010
[Talk]
Team-based Message Logging: Preliminary Results [Resilience 2010]
09-27
2009
[Talk]
Adaptive Runtime Support for Fault Tolerance [PPL Talk 2009]