[R-sig-hpc] RFC: Checkpoint-Restart for R/HPC (DMTCP)

Chirag Anand anand.chirag at gmail.com
Mon Jan 25 12:56:58 CET 2016


This can indeed be very useful, especially while using one of the
cloud services. Cloud VMs often crash because of an error on the main
system, thereby, losing state of the program (R computations). I think
Google Cloud Engine supports live migration of VMs, though not sure
which technology they are using, but AWS does not.

On 21 January 2016 at 05:29, Gene Cooperman <gene at ccs.neu.edu> wrote:
> Hello Everyone,
>
> The R language currently allows the ability through save.image(), for
> saving all objects in a workspace.  But what if you are in the middle of
> a long-running computation in R, and you're worrying about the computer
> crashing?  Wouldn't it be nice if that computation restarted from the
> point that it failed, and continue to complete the computation?
>
> Our group has developed and implemented the DMTCP (Distributed
> MultiThreaded Checkpointing) concepts for more than a decade, which is
> widely-accessed and adopted, and currently is at version 2.4.3.  It allows
> for checkpoint-restart of Linux processes (such as an R session), while
> the calculations are still processing.
>
>  DMTCP information is here:
>     http://dmtcp.sourceforge.net
>
> Building DMTCP is as easy as untar/configure/make.  Below is a simple
> example of how to run R through the DMTCP wrapper:
>
>    $ dmctp_launch --interval 300 R
>       # This session will start R where one would proceed with the
>       # computation;
>       # In this session, at every 300 seconds (5 minutes), it will save:
>       #    1) A checkpoint image file and
>       #    2) A dmtcp_restart_script.sh in the current directory.
>    *** CRASH! *** ( Let's assume the computer crashes, and one then
> reboots.)
>
>    # To restart the computation at the last checkpoint, R is launched as
>    # follows
> :
>    $ ./dmtcp_restart_script.sh
>
> As the BioConductor community is one of the most diverse and largest
> users of R, we would like to get an idea if people would find these
> features helpful.  We would be more than glad to help the R/BioCondutor
> community in creating a package that implements these concepts.  We would
> also be happy to answer any questions you might have.  If you would like
> more details on DMTCP, feel free to look through the questions/answers
> in the DMTCP FAQ ( http://dmtcp.sourceforge.net/FAQ.html ) or you can
> just ask your questions here.
>
> We also have a DMTCP forum, as well as other venues to provide
> a friendly way to get further help from the DMTCP team:
>   http://dmtcp.sourceforge.net/contactUs.html
>
> We look forward to your comments.
>
> Best wishes,
> - Gene Cooperman
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc



-- 
Chirag Anand
http://atvariance.in/chiraganand



More information about the R-sig-hpc mailing list