[R] Significant performance difference between split of a data.frame and split of vectors

Wed Dec 9 18:56:11 CET 2009

On Wed, Dec 9, 2009 at 11:20 AM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
> On Wed, 9 Dec 2009, Peng Yu wrote:
>
>> On Tue, Dec 8, 2009 at 11:06 PM, David Winsemius <dwinsemius at comcast.net>
>> wrote:
>>>
>>> On Dec 9, 2009, at 12:00 AM, Peng Yu wrote:
>>>
>>>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius
>>>> <dwinsemius at comcast.net>
>>>> wrote:
>>>>>
>>>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>>>>
>>>>>> I have the following code, which tests the split on a data.frame and
>>>>>> the split on each column (as vector) separately. The runtimes are of
>>>>>> 10 time difference. When m and k increase, the difference become even
>>>>>> bigger.
>>>>>>
>>>>>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>>>>>> in R? Can it be improved?
>>>>>
>>>>> You might want to look at the data.table package. The author calinms
>>>>> significant speed improvements over dta.frames
>>>>
>>>> This bug has been found long time back and a package has been
>>>> developed for it. Should the fix be integrated in data.frame rather
>>>> than be implemented in an additional package?
>>>
>>> What bug?
>>
>> Is the slow speed in splitting a data.frame a performance bug?
>>
>
> NO!
>
> The two computations are not equivalent.
>
> One is a list whose elements are split vectors, and the other is a list of
> data.frames containing those vectors.

I made a comparable example below. Still splitting data.frame is much
slower comparing with the second way that I'm showing.

> If you take the trouble to assemble that list of data frames from the list
> of split vectors you will see that it is very time consuming.

It is not as I show in the example below.

> Read up on memory management issues. Think about what the computer actually
> has to do in terms of memory access to split a data.frame versus split a
> vector.

I'd like to read more on how R do memory management. Would you please
point me a good source?

But again, R is not user friendly. It took me quite a long time to
figure out that splitting a data.frame is a bottle neck in my program
and reduce the problem into a test case. I don't know how memory
management is done in R so that I don't know if it is possible to fix
the problem for splitting a data.frame without perturbing the
interface of data.frame. But if the speed of splitting data.frame is
so slow, maybe it can be forbidden and an alternative can be
documented somewhere.

> ---
>
> And even if it were simply a matter of having code that is slow for some
> application, that would not be a bug. Read the FAQ!

The definition of a bug is on the FAQ is narrower than what I thought.
No matter what a definition of a bug is, split() on a data.frame is
perfectly legitimate operation (in terms of an interface). A quick fix
to this problem is to at least single out the case where the argument
is a data.frame, and to do what I have been doing below. Therefore,
that is why I say this is a performance bug. Similar cases, where a
faster alternative can be done but is not done, are perfect to call
bugs, at least in many other languages.

> m=300000
> n=6
> k=30000
>
> set.seed(0)
> x=replicate(n,rnorm(m))
> f=sample(1:k, size=m, replace=T)
>
> system.time(split(as.data.frame(x),f))
   user  system elapsed
 39.020   0.010  39.084
>
> v=lapply(
+     1:dim(x)[[2]]
+     , function(i) {
+       split(x[,i],f)
+     }
+     )
>
> system.time(lapply(
+         1:dim(x)[[2]]
+         , function(i) {
+           split(x[,i],f)
+         }
+         )
+     )
   user  system elapsed
  2.520   0.000   2.526
>
> system.time(
+     mapply(
+         function(...) {
+           cbind(...)
+         }
+         , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]]
+         )
+     )
   user  system elapsed
  0.920   0.000   0.927
>

>>>>
>>>>> David.
>>>>>>
>>>>>>> system.time(split(as.data.frame(x),f))
>>>>>>
>>>>>>  user  system elapsed
>>>>>>  1.700   0.010   1.786
>>>>>>>
>>>>>>> system.time(lapply(
>>>>>>
>>>>>> +         1:dim(x)[[2]]
>>>>>> +         , function(i) {
>>>>>> +           split(x[,i],f)
>>>>>> +         }
>>>>>> +         )
>>>>>> +     )
>>>>>>  user  system elapsed
>>>>>>  0.170   0.000   0.167
>>>>>>
>>>>>> ###########
>>>>>> m=30000
>>>>>> n=6
>>>>>> k=3000
>>>>>>
>>>>>> set.seed(0)
>>>>>> x=replicate(n,rnorm(m))
>>>>>> f=sample(1:k, size=m, replace=T)
>>>>>>
>>>>>> system.time(split(as.data.frame(x),f))
>>>>>>
>>>>>> system.time(lapply(
>>>>>>      1:dim(x)[[2]]
>>>>>>      , function(i) {
>>>>>>        split(x[,i],f)
>>>>>>      }
>>>>>>      )
>>>>>>  )
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>> David Winsemius, MD
>>>>> Heritage Laboratories
>>>>> West Hartford, CT
>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> David Winsemius, MD
>>> Heritage Laboratories
>>> West Hartford, CT
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry                            (858) 534-2098
>                                            Dept of Family/Preventive
> Medicine
> E mailto:cberry at tajo.ucsd.edu               UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
>