[R] problem with split eating giga-bytes of memory

Wed Dec 9 04:48:52 CET 2009

Hi Mark,

Why are you using factors?  I think for this case you might find
characters are faster and more space efficient.

Alternatively, you can have a look at the plyr package which uses some
tricks to keep memory usage down.

Hadley

On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkimpel at gmail.com> wrote:
> Charles, I suspect your are correct regarding copying of the attributes.
> First off, selectSubAct.df is my "real" data, which turns out to be of the
> same dim() as myDataFrame below, but each column is make up of strings, not
> simple letters, and there are many levels in each column, which I did not
> properly duplicate in my first example. I have ammended that below and with
> the split the new object size is now not 10X the size of the original, but
> 100X. My "real" data is even more complex than this, so I suspect that is
> where the problem lies. I need to search for a better solution to my problem
> than split, for which I will start a separate thread if I can't figure
> something out.
>
> Thanks for pointing me in the right direction,
>
> Mark
>
> myDataFrame <- data.frame(matrix(paste("The rain in Spain",
> as.character(1:1400), sep = "."), ncol = 7, nrow = 399000))
> mySplitVar <- factor(paste("Rainy days and Mondays", as.character(1:1400),
> sep = "."))
> myDataFrame <- cbind(myDataFrame, mySplitVar)
> object.size(myDataFrame)
> ## 12860880 bytes # ~ 13MB
> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
> object.size(myDataFrame.split)
> ## 1,274,929,792 bytes ~ 1.2GB
> object.size(selectSubAct.df)
> ## 52,348,272 bytes # ~ 52MB
> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN  46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 399-1219 Skype No Voicemail please
>
>
> On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry <cberry at tajo.ucsd.edu>wrote:
>
>> On Tue, 8 Dec 2009, Mark Kimpel wrote:
>>
>>  I'm having trouble using split on a very large data-set with ~1400 levels
>>> of
>>> the factor to be split. Unfortunately, I can't reproduce it with the
>>> simple
>>> self-contained example below. As you can see, splitting the artificial
>>> dataframe of size ~13MB results in a split dataframe of ~ 144MB, with an
>>> increase memory allocation of ~10 fold for the split object. If split
>>> scales
>>> linearly, then my actual 52MB dataframe should be easily handled by my
>>> 12GB
>>> of RAM, but it is not. instead, when I try to split selectSubAct.df on one
>>> of its factors with 1473 levels, my memory is slowly gobbled up (plus 3 GB
>>> of swap) until I cancel the operation.
>>>
>>> Any ideas on what might be happening? Thanks, Mark
>>>
>>
>> Each element of myDataFrame.split contains a copy of the attributes of the
>> parent data.frame.
>>
>> And probably it does scale linearly. But the scaling factor depends on the
>> size of the attributes that get copied, I guess.
>>
>>
>>
>>
>>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000))
>>> mySplitVar <- factor(as.character(1:1400))
>>> myDataFrame <- cbind(myDataFrame, mySplitVar)
>>> object.size(myDataFrame)
>>> ## 12860880 bytes # ~ 13MB
>>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
>>> object.size(myDataFrame.split)
>>> ## 144524992 bytes # ~ 144MB
>>>
>>
>> Note:
>>
>>  only.attr <- lapply(myDataFrame.split,function(x) sapply(x,attributes))
>>>
>>> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr)
>>>
>> 1.03726179240978 bytes
>>
>>
>>>
>>
>>  object.size(selectSubAct.df)
>>> ## 52,348,272 bytes # ~ 52MB
>>>
>>
>> What was this??
>>
>>
>> Chuck
>>
>>
>>>  sessionInfo()
>>>>
>>> R version 2.10.0 Patched (2009-10-27 r50222)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.10.0
>>>
>>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>>> Indiana University School of Medicine
>>>
>>> 15032 Hunter Court, Westfield, IN  46074
>>>
>>> (317) 490-5129 Work, & Mobile & VoiceMail
>>> (317) 399-1219 Skype No Voicemail please
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> Charles C. Berry                            (858) 534-2098
>>                                            Dept of Family/Preventive
>> Medicine
>> E mailto:cberry at tajo.ucsd.edu               UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>>
>>
>>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
http://had.co.nz/