[Rd] as.data.frame requires a lot of memory (PR#14140)

Simon Urbanek simon.urbanek at r-project.org
Mon Dec 14 23:01:19 CET 2009


On Dec 14, 2009, at 12:45 , rfalke at tzi.de wrote:

> Full_Name: Raimar Falke
> Version: R version 2.10.0 (2009-10-26)
> OS: Linux 2.6.27-16-generic #1 SMP Tue Dec 1 19:26:23 UTC 2009  
> x86_64 GNU/Linux
> Submission from: (NULL) (134.102.222.56)
>
>
> The construction of a data frame in the way shown below requires
> much more memory than expected. If we assume a cell value takes 8  
> bytes
> the total amount of the data is 128mb. However the process takes about
> 920mb and not the expected 256mb (two times the data set).
>
> With the real data sets (~35000 observations with ~33000 attributes)  
> the
> conversion to a data frame requires has to be killed at with 60gb of
> memory usage while it should only require 17.6gb (2*8.8gb).
>
>  dfn <- rep(list(rep(0, 4096)), 4096)
>  test <- as.data.frame.list(dfn)
>
> I also tried the incremental construction of the
> data-frame: df$colN <- dataForColN. While I currently can't say much
> about the memory usage, it takes a looong time.
>
> After the construction the saved-and-loaded data-frame has the  
> expected size.
>
> What is the recommended way to construct larger data-frames?
>

Please use R-help for questions, and not the bug tracking system!



There are few issues with your example - mainly because is has no row  
names and no column names so R will try to create them from the  
expression which is inherently slow and memory-consuming. So first,  
make sure you set the names on the list, e.g.:

names(dfn) <- paste("V",seq.int(length(dfn)),sep='')

That will reduce the overhead due to column names. Then what  
as.data.frame does is to simply call data.frame on the elements of the  
list. That ensures that all is right, but if you know for sure that  
your list is valid (correct lengths, valid names, no need for row  
names etc.) then you can simply assert that it is a data frame:

class(dfn)<-"data.frame"
row.names(dfn)<-NULL

You'll still need double the memory because the object needs to be  
copied for the attribute modifications, but that's as low as it get --  
although in your exact example there is an even more efficient way:

dfn <- rep(data.frame(X=rep(0, 4096)), 4096)
dfn <- do.call("cbind", dfn)

it uses only a fraction more memory than the size of the entire  
object, but that's for entirely different reasons :). No, it's not  
good in general :P

Cheers,
Simon



More information about the R-devel mailing list