[R] Significant performance difference between split of a data.frame and split of vectors

Peng Yu pengyu.ut at gmail.com
Wed Dec 9 18:07:58 CET 2009


On Tue, Dec 8, 2009 at 11:06 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Dec 9, 2009, at 12:00 AM, Peng Yu wrote:
>
>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net>
>> wrote:
>>>
>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>>
>>>> I have the following code, which tests the split on a data.frame and
>>>> the split on each column (as vector) separately. The runtimes are of
>>>> 10 time difference. When m and k increase, the difference become even
>>>> bigger.
>>>>
>>>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>>>> in R? Can it be improved?
>>>
>>> You might want to look at the data.table package. The author calinms
>>> significant speed improvements over dta.frames
>>
>> This bug has been found long time back and a package has been
>> developed for it. Should the fix be integrated in data.frame rather
>> than be implemented in an additional package?
>
> What bug?

Is the slow speed in splitting a data.frame a performance bug?

>>
>>> David.
>>>>
>>>>> system.time(split(as.data.frame(x),f))
>>>>
>>>>  user  system elapsed
>>>>  1.700   0.010   1.786
>>>>>
>>>>> system.time(lapply(
>>>>
>>>> +         1:dim(x)[[2]]
>>>> +         , function(i) {
>>>> +           split(x[,i],f)
>>>> +         }
>>>> +         )
>>>> +     )
>>>>  user  system elapsed
>>>>  0.170   0.000   0.167
>>>>
>>>> ###########
>>>> m=30000
>>>> n=6
>>>> k=3000
>>>>
>>>> set.seed(0)
>>>> x=replicate(n,rnorm(m))
>>>> f=sample(1:k, size=m, replace=T)
>>>>
>>>> system.time(split(as.data.frame(x),f))
>>>>
>>>> system.time(lapply(
>>>>      1:dim(x)[[2]]
>>>>      , function(i) {
>>>>        split(x[,i],f)
>>>>      }
>>>>      )
>>>>  )
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> David Winsemius, MD
>>> Heritage Laboratories
>>> West Hartford, CT
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>




More information about the R-help mailing list