[R-sig-hpc] checkpointing with foreach

Fri Mar 22 22:24:40 CET 2013

Hi all,

For a number of years, I've used my own 'mpifarm' package, which is
built on top of Rmpi, for parallel computing on clusters.  Recently,
I've been moving toward 'foreach', with its elegant syntax,
flexibility, and neat separation of back- and front-ends.

A few years ago, when I moved my HPC onto the university's big
cluster, I discovered that I could no longer live without some kind of
checkpointing facility.  Not so much because my codes were crashing,
though they sometimes were, but because it became necessary to submit
codes with small walltime requests in order to get scheduling priority
for big jobs in a sea of small, easily-scheduled jobs.  So I started
queueing up a number of instances of the same job, each with a modest
walltime request and the ability to read its predecessor's checkpoint
file and pick up where it left off.  This is all within-R
checkpointing for embarrassingly parallel problems: basically the
master looks for a checkpoint .rda file, loads it if it finds it, and
adjusts its to-do list accordingly.  It then updates the checkpoint
file from time to time.  I need this functionality in my new
'foreach'-based codes, and can think of several ways of doing it, not
all of which are likely to work on the first try.  Before I go about
reinventing that wheel, I wonder if anyone out there has come up with
a robust, flexible solution to checkpointing that's built on top of
'foreach'.   Or any advice?

Aaron

-- 
Aaron A. King, Ph.D.
Ecology & Evolutionary Biology
Mathematics
Center for the Study of Complex Systems
University of Michigan
GPG Public Key: 0x15780975