[Rd] as.data.frame requires a lot of memory (PR#14140)
Simon Urbanek
simon.urbanek at r-project.org
Mon Dec 14 23:01:19 CET 2009
On Dec 14, 2009, at 12:45 , rfalke at tzi.de wrote:
> Full_Name: Raimar Falke
> Version: R version 2.10.0 (2009-10-26)
> OS: Linux 2.6.27-16-generic #1 SMP Tue Dec 1 19:26:23 UTC 2009
> x86_64 GNU/Linux
> Submission from: (NULL) (134.102.222.56)
>
>
> The construction of a data frame in the way shown below requires
> much more memory than expected. If we assume a cell value takes 8
> bytes
> the total amount of the data is 128mb. However the process takes about
> 920mb and not the expected 256mb (two times the data set).
>
> With the real data sets (~35000 observations with ~33000 attributes)
> the
> conversion to a data frame requires has to be killed at with 60gb of
> memory usage while it should only require 17.6gb (2*8.8gb).
>
> dfn <- rep(list(rep(0, 4096)), 4096)
> test <- as.data.frame.list(dfn)
>
> I also tried the incremental construction of the
> data-frame: df$colN <- dataForColN. While I currently can't say much
> about the memory usage, it takes a looong time.
>
> After the construction the saved-and-loaded data-frame has the
> expected size.
>
> What is the recommended way to construct larger data-frames?
>
Please use R-help for questions, and not the bug tracking system!
There are few issues with your example - mainly because is has no row
names and no column names so R will try to create them from the
expression which is inherently slow and memory-consuming. So first,
make sure you set the names on the list, e.g.:
names(dfn) <- paste("V",seq.int(length(dfn)),sep='')
That will reduce the overhead due to column names. Then what
as.data.frame does is to simply call data.frame on the elements of the
list. That ensures that all is right, but if you know for sure that
your list is valid (correct lengths, valid names, no need for row
names etc.) then you can simply assert that it is a data frame:
class(dfn)<-"data.frame"
row.names(dfn)<-NULL
You'll still need double the memory because the object needs to be
copied for the attribute modifications, but that's as low as it get --
although in your exact example there is an even more efficient way:
dfn <- rep(data.frame(X=rep(0, 4096)), 4096)
dfn <- do.call("cbind", dfn)
it uses only a fraction more memory than the size of the entire
object, but that's for entirely different reasons :). No, it's not
good in general :P
Cheers,
Simon
More information about the R-devel
mailing list