[R-sig-hpc] RFC: Checkpoint-Restart for R/HPC (DMTCP)

Gene Cooperman gene at ccs.neu.edu
Thu Jan 21 00:59:07 CET 2016


Hello Everyone,

The R language currently allows the ability through save.image(), for
saving all objects in a workspace.  But what if you are in the middle of
a long-running computation in R, and you're worrying about the computer
crashing?  Wouldn't it be nice if that computation restarted from the
point that it failed, and continue to complete the computation?

Our group has developed and implemented the DMTCP (Distributed
MultiThreaded Checkpointing) concepts for more than a decade, which is
widely-accessed and adopted, and currently is at version 2.4.3.  It allows
for checkpoint-restart of Linux processes (such as an R session), while
the calculations are still processing.

 DMTCP information is here:
    http://dmtcp.sourceforge.net

Building DMTCP is as easy as untar/configure/make.  Below is a simple
example of how to run R through the DMTCP wrapper:

   $ dmctp_launch --interval 300 R
      # This session will start R where one would proceed with the
      # computation;
      # In this session, at every 300 seconds (5 minutes), it will save:
      #    1) A checkpoint image file and
      #    2) A dmtcp_restart_script.sh in the current directory.
   *** CRASH! *** ( Let's assume the computer crashes, and one then
reboots.)

   # To restart the computation at the last checkpoint, R is launched as
   # follows
:
   $ ./dmtcp_restart_script.sh

As the BioConductor community is one of the most diverse and largest
users of R, we would like to get an idea if people would find these
features helpful.  We would be more than glad to help the R/BioCondutor
community in creating a package that implements these concepts.  We would
also be happy to answer any questions you might have.  If you would like
more details on DMTCP, feel free to look through the questions/answers
in the DMTCP FAQ ( http://dmtcp.sourceforge.net/FAQ.html ) or you can
just ask your questions here.

We also have a DMTCP forum, as well as other venues to provide
a friendly way to get further help from the DMTCP team:
  http://dmtcp.sourceforge.net/contactUs.html

We look forward to your comments.

Best wishes,
- Gene Cooperman



More information about the R-sig-hpc mailing list