[R-sig-hpc] RFC: Checkpoint-Restart for R/HPC (DMTCP)
Gene Cooperman
gene at ccs.neu.edu
Tue Jan 26 02:03:52 CET 2016
Hi Chirag,
This should work. In my case, I would probably try running
a job on a cloud as follows:
[ copy DMTCP executables to job submission directory ]
path_to_dmtcp_root/bin/dmtcp_launch -i 30 Rscript myscript.R
This would create a checkpoint every 30 seconds. So, every 30 seconds,
we get a new version of the following files:
ckpt_myscript.R_*.dmtcp
dmtcp_restart_script_*.sh
dmtpc_restart_script.sh (symbolic link to dmtcp_restart_script_*.sh)
If a job crashes, one copies the above files to a new directory, and
submits a new Cloud job:
[ copy DMTCP executables to job submission directory ]
./dmtcp_restart_script.sh -i 30
The script should automatically link to the file ckpt_myscript.R_*.dmtcp .
An alternative approach would be:
path_to_dmtcp_root/bin/dmtcp_restart -i 30 ckpt_myscript.R_*.dmtcp
Please don't hesitate to ask, if I can help further.
Best,
- Gene
On Mon, Jan 25, 2016 at 05:26:58PM +0530, Chirag Anand wrote:
> This can indeed be very useful, especially while using one of the
> cloud services. Cloud VMs often crash because of an error on the main
> system, thereby, losing state of the program (R computations). I think
> Google Cloud Engine supports live migration of VMs, though not sure
> which technology they are using, but AWS does not.
>
...
>
> --
> Chirag Anand
> http://atvariance.in/chiraganand
More information about the R-sig-hpc
mailing list