[Rd] Reduce memory peak when serializing to raw vectors

Michael Lawrence lawrence.michael at gene.com
Tue Mar 17 22:48:00 CET 2015


Presumably one could stream over the data twice, the first to get the size,
without storing the data. Slower but more memory efficient, unless I'm
missing something.

Michael

On Tue, Mar 17, 2015 at 2:03 PM, Simon Urbanek <simon.urbanek at r-project.org>
wrote:

> Jorge,
>
> what you propose is not possible because the size of the output is
> unknown, that's why a dynamically growing PStream buffer is used - it
> cannot be pre-allocated.
>
> Cheers,
> Simon
>
>
> > On Mar 17, 2015, at 1:37 PM, Martinez de Salinas, Jorge <
> jorge.martinez-de-salinas at hp.com> wrote:
> >
> > Hi,
> >
> > I've been doing some tests using serialize() to a raw vector:
> >
> >       df <- data.frame(runif(50e6,1,10))
> >       ser <- serialize(df,NULL)
> >
> > In this example the data frame and the serialized raw vector occupy
> ~400MB each, for a total of ~800M. However the memory peak during
> serialize() is ~1.2GB:
> >
> >       $ cat /proc/15155/status |grep Vm
> >       ...
> >       VmHWM:   1207792 kB
> >       VmRSS:    817272 kB
> >
> > We work with very large data frames and in many cases this is killing R
> with an "out of memory" error.
> >
> > This is the relevant code in R 3.1.3 in src/main/serialize.c:2494
> >
> >       InitMemOutPStream(&out, &mbs, type, version, hook, fun);
> >       R_Serialize(object, &out);
> >       val =  CloseMemOutPStream(&out);
> >
> > The serialized object is being stored in a buffer pointed by out.data.
> Then in CloseMemOutPStream() R copies the whole buffer to a newly allocated
> SEXP object (the raw vector that stores the final result):
> >
> >       PROTECT(val = allocVector(RAWSXP, mb->count));
> >       memcpy(RAW(val), mb->buf, mb->count);
> >       free_mem_buffer(mb);
> >       UNPROTECT(1);
> >
> > Before calling free_mem_buffer() the process is using ~1.2GB (the
> original data frame + the serialization buffer + final serialized raw
> vector).
> >
> > One possible solution would be to allocate a buffer for the final raw
> vector and store the serialization result directly into that buffer. This
> would bring the memory peak down from ~1.2GB to ~800MB.
> >
> > Thanks,
> > -Jorge
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list