[R] Significant performance difference between split of a data.frame and split of vectors

David Winsemius dwinsemius at comcast.net
Wed Dec 9 05:37:37 CET 2009


On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:

> I have the following code, which tests the split on a data.frame and
> the split on each column (as vector) separately. The runtimes are of
> 10 time difference. When m and k increase, the difference become even
> bigger.
>
> I'm wondering why the performance on data.frame is so bad. Is it a bug
> in R? Can it be improved?

You might want to look at the data.table package. The author calinms  
significant speed improvements over dta.frames

-- 
David.
>
>> system.time(split(as.data.frame(x),f))
>   user  system elapsed
>  1.700   0.010   1.786
>>
>> system.time(lapply(
> +         1:dim(x)[[2]]
> +         , function(i) {
> +           split(x[,i],f)
> +         }
> +         )
> +     )
>   user  system elapsed
>  0.170   0.000   0.167
>
> ###########
> m=30000
> n=6
> k=3000
>
> set.seed(0)
> x=replicate(n,rnorm(m))
> f=sample(1:k, size=m, replace=T)
>
> system.time(split(as.data.frame(x),f))
>
> system.time(lapply(
>        1:dim(x)[[2]]
>        , function(i) {
>          split(x[,i],f)
>        }
>        )
>    )
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list