[R] Significant performance difference between split of a data.frame and split of vectors

Peng Yu pengyu.ut at gmail.com
Wed Dec 9 20:59:49 CET 2009


On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>
>> I have the following code, which tests the split on a data.frame and
>> the split on each column (as vector) separately. The runtimes are of
>> 10 time difference. When m and k increase, the difference become even
>> bigger.
>>
>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>> in R? Can it be improved?
>
> You might want to look at the data.table package. The author calinms
> significant speed improvements over dta.frames

'data.table' doesn't seem to help. You can try the other set of m,n,k.
In both case, using as.data.frame is faster than using as.data.table.

Please let me know if I understand what you meant.

> m=10
> n=6
> k=3
>
> #m=300000
> #n=6
> #k=30000
>
> set.seed(0)
> x=replicate(n,rnorm(m))
> f=sample(1:k, size=m, replace=T)
>
> library(data.table)
Loading required package: ref
dim(refdata) and dimnames(refdata) no longer allow parameter ref=TRUE,
use dim(derefdata(refdata)), dimnames(derefdata(refdata)) instead
> system.time(split(as.data.frame(x),f))
   user  system elapsed
  0.000   0.000   0.003
> system.time(split(as.data.table(x),f))
   user  system elapsed
  0.010   0.000   0.011

>>> system.time(split(as.data.frame(x),f))
>>
>>  user  system elapsed
>>  1.700   0.010   1.786
>>>
>>> system.time(lapply(
>>
>> +         1:dim(x)[[2]]
>> +         , function(i) {
>> +           split(x[,i],f)
>> +         }
>> +         )
>> +     )
>>  user  system elapsed
>>  0.170   0.000   0.167
>>
>> ###########
>> m=30000
>> n=6
>> k=3000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> system.time(split(as.data.frame(x),f))
>>
>> system.time(lapply(
>>       1:dim(x)[[2]]
>>       , function(i) {
>>         split(x[,i],f)
>>       }
>>       )
>>   )
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>




More information about the R-help mailing list