Research
No matter the scale, no matter the complexity, computer systems non-experts should be able to use supercomputers and massively parallel systems in scalable, efficient, reliable ways.
Keywords: High performance computing, Large scale distributed systems, Autonomous systems, Fault-tolerance, HPC Tools.
Current Research Projects
- Fault-tolerance
- Checkpoint/Restart Enhancements
- Simulating Resilience at Scale
- Resilience Benchmarks
- Reliable Data Aggregation
- Middleware, Tools and System Software