[R] Significant performance difference between split of a data.frame and split of vectors
Charles C. Berry
cberry at tajo.ucsd.edu
Wed Dec 9 18:20:42 CET 2009
On Wed, 9 Dec 2009, Peng Yu wrote:
> On Tue, Dec 8, 2009 at 11:06 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>>
>> On Dec 9, 2009, at 12:00 AM, Peng Yu wrote:
>>
>>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>>>
>>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>>>
>>>>> I have the following code, which tests the split on a data.frame and
>>>>> the split on each column (as vector) separately. The runtimes are of
>>>>> 10 time difference. When m and k increase, the difference become even
>>>>> bigger.
>>>>>
>>>>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>>>>> in R? Can it be improved?
>>>>
>>>> You might want to look at the data.table package. The author calinms
>>>> significant speed improvements over dta.frames
>>>
>>> This bug has been found long time back and a package has been
>>> developed for it. Should the fix be integrated in data.frame rather
>>> than be implemented in an additional package?
>>
>> What bug?
>
> Is the slow speed in splitting a data.frame a performance bug?
>
NO!
The two computations are not equivalent.
One is a list whose elements are split vectors, and the other is a list of
data.frames containing those vectors.
If you take the trouble to assemble that list of data frames from the
list of split vectors you will see that it is very time consuming.
Read up on memory management issues. Think about what the computer
actually has to do in terms of memory access to split a data.frame versus
split a vector.
---
And even if it were simply a matter of having code that is slow for some
application, that would not be a bug. Read the FAQ!
Chuck
>>>
>>>> David.
>>>>>
>>>>>> system.time(split(as.data.frame(x),f))
>>>>>
>>>>> user system elapsed
>>>>> 1.700 0.010 1.786
>>>>>>
>>>>>> system.time(lapply(
>>>>>
>>>>> + 1:dim(x)[[2]]
>>>>> + , function(i) {
>>>>> + split(x[,i],f)
>>>>> + }
>>>>> + )
>>>>> + )
>>>>> user system elapsed
>>>>> 0.170 0.000 0.167
>>>>>
>>>>> ###########
>>>>> m=30000
>>>>> n=6
>>>>> k=3000
>>>>>
>>>>> set.seed(0)
>>>>> x=replicate(n,rnorm(m))
>>>>> f=sample(1:k, size=m, replace=T)
>>>>>
>>>>> system.time(split(as.data.frame(x),f))
>>>>>
>>>>> system.time(lapply(
>>>>> 1:dim(x)[[2]]
>>>>> , function(i) {
>>>>> split(x[,i],f)
>>>>> }
>>>>> )
>>>>> )
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> David Winsemius, MD
>>>> Heritage Laboratories
>>>> West Hartford, CT
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list