[R] checkpointing
Andy Jacobson
@ndy@j@cob@on @end|ng |rom no@@@gov
Wed Dec 15 02:59:48 CET 2021
I have been using DMTCP successfully for a long-running optim() task. This is a single-core process running on a large linux cluster with slurm as the job manager. This cluster places an 8-hour limit on individual jobs, and since my cost function takes 11 minutes to compute, I need many such jobs run sequentially. To make DMTCP work, I have had to rework file I/O to avoid references to temporary files written to /tmp, but other than that...optim() is checkpointed just before 8 hours is up, and then resumed successfully in a subsequent batch job running on a different core of the cluster.
While I have an answer for my particular task, it would still be useful to checkpoint using the scheme Henrik suggests. Thanks all for the interesting conversation!
-Andy
On 12/14/21 5:39 PM, Henrik Bengtsson wrote:
> On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson <andy using yovo.org> wrote:
>>
>> Those are good points, Duncan. I am experimenting with a nice checkpointing tool called DMTCP. It operates on the system level but is quite OS-dependent. It can be found at http://dmtcp.sourceforge.net/index.html.
>>
>> Still, it would be nice to be able to checkpoint calls within R to potentially long-running processes like optim().
>
> Teasing idea. Imagine if we could come up with some de-facto standard
> API for this and that such a framework could be called automatically
> by R. Something similar to how user interrupts are checked (e.g.
> R_CheckUserInterrupt()) on a regular basis by the R engine and
> through-out the R code. That could help troubleshooting and debugging,
> e.g. sending the checkpoint to someone else or going backwards in
> time.
>
> Pasting in the below since I failed to hit Reply *All* the other day,
> and it was only Richard who got it:
>
> A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
> CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp). I'm
> sharing in case someone is interested in investigating this further.
> Also, somewhere on the DMTCP wiki, they asked for testing with R by
> more experienced users.
>
> "DMTCP is a tool to transparently checkpoint the state of multiple
> simultaneous applications, including multi-threaded and distributed
> applications. It operates directly on the user binary executable,
> without any Linux kernel modules or other kernel modifications."
>
> They seem to be able to run this with HPC jobs, open files, Linux
> containers, and even MPI, and so on. I've only tested it very quickly
> with interactive R and it seems to work. Obviously more testing needs
> to be done to identify when it doesn't work. For example, I'd have a
> hard time it would work out of the box with local parallel PSOCK
> workers. They mention "plug-ins", so maybe there's a way to adding
> support for specific use cases on a one by one.
>
> Different academic HPC environment appear to use it, e.g.
>
> * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
> * http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
> * https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP
>
> That's all I have time for now,
>
> Henrik
>
>>
>> -Andy
>>
>> On 12/13/21 11:51 AM, Duncan Murdoch wrote:
>>> On 13/12/2021 12:58 p.m., Greg Minshall wrote:
>>>> Jeff,
>>>>
>>>>> This sounds like an OS feature, not an R feature... certainly not a
>>>>> portable R feature.
>>>>
>>>> i'm not arguing for it, but this seems to me like something that could
>>>> be a language feature.
>>>>
>>>
>>> R functions can call libraries written in other languages, and can start processes, etc. R doesn't know everything going on in every function call, and would have a lot of trouble saving it.
>>>
>>> If you added some limitations, e.g. a process that periodically has its entire state stored in R variables, then it would be a lot easier.
>>>
>>> Duncan Murdoch
>>
>> --
>> Andy Jacobson
>> andy using yovo.org
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
--
Andy Jacobson
andy.jacobson using noaa.gov
NOAA Global Monitoring Lab
325 Broadway
Boulder, Colorado 80305
303/497-4916
More information about the R-help
mailing list