[Rd] Reduce memory peak when serializing to raw vectors

Martinez de Salinas, Jorge jorge.martinez-de-salinas at hp.com
Wed Mar 18 06:53:44 CET 2015


Thanks Simon, Michael. 
Looking at the design more carefully I think we can get away with serializing directly to sockets or to a file in /dev/shm if we want to keep things in memory.

-Jorge

From: Simon Urbanek [mailto:simon.urbanek at r-project.org] 
Sent: Tuesday, March 17, 2015 3:13 PM
To: Michael Lawrence
Cc: Martinez de Salinas, Jorge; r-devel at r-project.org
Subject: Re: [Rd] Reduce memory peak when serializing to raw vectors

In principle, yes (that's what Rserve serialization does), but AFAIR we don't have the infrastructure in place for that. But then you may as well serialize to a connection instead. To be honest I don't see why you would serialize anything big to a vector - you can't really do anything useful with that ... (what you couldn't do with the streaming version).

Sent from my iPhone

On Mar 17, 2015, at 17:48, Michael Lawrence <lawrence.michael at gene.com> wrote:
Presumably one could stream over the data twice, the first to get the size, without storing the data. Slower but more memory efficient, unless I'm missing something.
Michael

On Tue, Mar 17, 2015 at 2:03 PM, Simon Urbanek <simon.urbanek at r-project.org> wrote:
Jorge,

what you propose is not possible because the size of the output is unknown, that's why a dynamically growing PStream buffer is used - it cannot be pre-allocated.

Cheers,
Simon


> On Mar 17, 2015, at 1:37 PM, Martinez de Salinas, Jorge <jorge.martinez-de-salinas at hp.com> wrote:
>
> Hi,
>
> I've been doing some tests using serialize() to a raw vector:
>
>       df <- data.frame(runif(50e6,1,10))
>       ser <- serialize(df,NULL)
>
> In this example the data frame and the serialized raw vector occupy ~400MB each, for a total of ~800M. However the memory peak during serialize() is ~1.2GB:
>
>       $ cat /proc/15155/status |grep Vm
>       ...
>       VmHWM:   1207792 kB
>       VmRSS:    817272 kB
>
> We work with very large data frames and in many cases this is killing R with an "out of memory" error.
>
> This is the relevant code in R 3.1.3 in src/main/serialize.c:2494
>
>       InitMemOutPStream(&out, &mbs, type, version, hook, fun);
>       R_Serialize(object, &out);
>       val =  CloseMemOutPStream(&out);
>
> The serialized object is being stored in a buffer pointed by out.data. Then in CloseMemOutPStream() R copies the whole buffer to a newly allocated SEXP object (the raw vector that stores the final result):
>
>       PROTECT(val = allocVector(RAWSXP, mb->count));
>       memcpy(RAW(val), mb->buf, mb->count);
>       free_mem_buffer(mb);
>       UNPROTECT(1);
>
> Before calling free_mem_buffer() the process is using ~1.2GB (the original data frame + the serialization buffer + final serialized raw vector).
>
> One possible solution would be to allocate a buffer for the final raw vector and store the serialization result directly into that buffer. This would bring the memory peak down from ~1.2GB to ~800MB.
>
> Thanks,
> -Jorge
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list