[R] Significant performance difference between split of a data.frame and split of vectors
Peng Yu
pengyu.ut at gmail.com
Wed Dec 9 06:00:28 CET 2009
On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>
>> I have the following code, which tests the split on a data.frame and
>> the split on each column (as vector) separately. The runtimes are of
>> 10 time difference. When m and k increase, the difference become even
>> bigger.
>>
>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>> in R? Can it be improved?
>
> You might want to look at the data.table package. The author calinms
> significant speed improvements over dta.frames
This bug has been found long time back and a package has been
developed for it. Should the fix be integrated in data.frame rather
than be implemented in an additional package?
> David.
>>
>>> system.time(split(as.data.frame(x),f))
>>
>> user system elapsed
>> 1.700 0.010 1.786
>>>
>>> system.time(lapply(
>>
>> + 1:dim(x)[[2]]
>> + , function(i) {
>> + split(x[,i],f)
>> + }
>> + )
>> + )
>> user system elapsed
>> 0.170 0.000 0.167
>>
>> ###########
>> m=30000
>> n=6
>> k=3000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> system.time(split(as.data.frame(x),f))
>>
>> system.time(lapply(
>> 1:dim(x)[[2]]
>> , function(i) {
>> split(x[,i],f)
>> }
>> )
>> )
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>
More information about the R-help
mailing list