[R] problem with split eating giga-bytes of memory

Charles C. Berry cberry at tajo.ucsd.edu
Wed Dec 9 04:22:46 CET 2009


On Tue, 8 Dec 2009, Mark Kimpel wrote:

> I'm having trouble using split on a very large data-set with ~1400 levels of
> the factor to be split. Unfortunately, I can't reproduce it with the simple
> self-contained example below. As you can see, splitting the artificial
> dataframe of size ~13MB results in a split dataframe of ~ 144MB, with an
> increase memory allocation of ~10 fold for the split object. If split scales
> linearly, then my actual 52MB dataframe should be easily handled by my 12GB
> of RAM, but it is not. instead, when I try to split selectSubAct.df on one
> of its factors with 1473 levels, my memory is slowly gobbled up (plus 3 GB
> of swap) until I cancel the operation.
>
> Any ideas on what might be happening? Thanks, Mark

Each element of myDataFrame.split contains a copy of the attributes of the 
parent data.frame.

And probably it does scale linearly. But the scaling factor depends on the 
size of the attributes that get copied, I guess.


>
> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000))
> mySplitVar <- factor(as.character(1:1400))
> myDataFrame <- cbind(myDataFrame, mySplitVar)
> object.size(myDataFrame)
> ## 12860880 bytes # ~ 13MB
> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
> object.size(myDataFrame.split)
> ## 144524992 bytes # ~ 144MB

Note:

> only.attr <- lapply(myDataFrame.split,function(x) sapply(x,attributes))
> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr)
1.03726179240978 bytes
>


> object.size(selectSubAct.df)
> ## 52,348,272 bytes # ~ 52MB

What was this??


Chuck

>
>> sessionInfo()
> R version 2.10.0 Patched (2009-10-27 r50222)
> x86_64-unknown-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
>
> loaded via a namespace (and not attached):
> [1] tools_2.10.0
>
> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN  46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 399-1219 Skype No Voicemail please
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901




More information about the R-help mailing list