Date September 5, 2008
Speaker GENE COOPERMAN (Northeastern University)
Topic Disk Based Parallel Computation and Checkpoint Restart

This talk represents some joint work of the speaker's High Performance Computing Laboratory. It highlights two loosely related topics. First, a vision for disk-based parallel computation is presented, based on our catch phrase "Disk is the New RAM". As the number of CPU cores grows, the RAM per CPU core tends to diminish on commodity computers. The solution is to use the many local disks of a computer cluster. Such a solution has been used to find a lower bound on solutions to Rubik's cube, among other applications.

Fifty local disks have approximately the bandwidth of a _single_ RAM subsystem. Thus, 50 local disks of a cluster have the potential to emulate a single 50 terabyte RAM subsystem. The obvious fallacy is the issue of disk latency. The solution is a new run-time library with an abstraction for many data structures and access methods. The library hides the awkward low-level plumbing of the data parallelism. Appropriate language design principles (simpler language constructs are more efficient than complex constructs) then bias the end user application toward good latency tolerance.

The second part of this talk then describes a mature user-space checkpoint-restart system, DMTCP, that transparently supports distributed, multi-threaded applications. Naturally, DMTCP is fully sufficient to checkpoint and restart our disk-based applications.



We thank the generous support of MIT IS&T, CSAIL, and the Department of Mathematics for their support of this series.

MIT Math CSAIL EAPS Lincoln Lab Harvard Astronomy