[R-sig-hpc] RFC: Checkpoint-Restart for R/HPC (DMTCP)

Tue Jan 26 02:03:52 CET 2016

Hi Chirag,

    This should work.  In my case, I would probably try running
a job on a cloud as follows:

    [ copy DMTCP executables to job submission directory ]
    path_to_dmtcp_root/bin/dmtcp_launch -i 30 Rscript myscript.R

This would create a checkpoint every 30 seconds.  So, every 30 seconds,
we get a new version of the following files:

    ckpt_myscript.R_*.dmtcp
    dmtcp_restart_script_*.sh
    dmtpc_restart_script.sh  (symbolic link to dmtcp_restart_script_*.sh)

If a job crashes, one copies the above files to a new directory, and
submits a new Cloud job:

    [ copy DMTCP executables to job submission directory ]
    ./dmtcp_restart_script.sh -i 30

The script should automatically link to the file ckpt_myscript.R_*.dmtcp .
An alternative approach would be:

    path_to_dmtcp_root/bin/dmtcp_restart -i 30 ckpt_myscript.R_*.dmtcp

Please don't hesitate to ask, if I can help further.

Best,
- Gene

On Mon, Jan 25, 2016 at 05:26:58PM +0530, Chirag Anand wrote:
> This can indeed be very useful, especially while using one of the
> cloud services. Cloud VMs often crash because of an error on the main
> system, thereby, losing state of the program (R computations). I think
> Google Cloud Engine supports live migration of VMs, though not sure
> which technology they are using, but AWS does not.
> 
...
> 
> -- 
> Chirag Anand
> http://atvariance.in/chiraganand