[R] Significant performance difference between split of a data.frame and split of vectors

Peng Yu pengyu.ut at gmail.com
Wed Dec 9 06:00:28 CET 2009


On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>
>> I have the following code, which tests the split on a data.frame and
>> the split on each column (as vector) separately. The runtimes are of
>> 10 time difference. When m and k increase, the difference become even
>> bigger.
>>
>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>> in R? Can it be improved?
>
> You might want to look at the data.table package. The author calinms
> significant speed improvements over dta.frames

This bug has been found long time back and a package has been
developed for it. Should the fix be integrated in data.frame rather
than be implemented in an additional package?

> David.
>>
>>> system.time(split(as.data.frame(x),f))
>>
>>  user  system elapsed
>>  1.700   0.010   1.786
>>>
>>> system.time(lapply(
>>
>> +         1:dim(x)[[2]]
>> +         , function(i) {
>> +           split(x[,i],f)
>> +         }
>> +         )
>> +     )
>>  user  system elapsed
>>  0.170   0.000   0.167
>>
>> ###########
>> m=30000
>> n=6
>> k=3000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> system.time(split(as.data.frame(x),f))
>>
>> system.time(lapply(
>>       1:dim(x)[[2]]
>>       , function(i) {
>>         split(x[,i],f)
>>       }
>>       )
>>   )
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>



More information about the R-help mailing list