[R] Significant performance difference between split of a data.frame and split of vectors

Wed Dec 9 22:15:55 CET 2009

On Wed, 9 Dec 2009, Peng Yu wrote:

> On Wed, Dec 9, 2009 at 11:20 AM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
>> On Wed, 9 Dec 2009, Peng Yu wrote:
>>
>>> On Tue, Dec 8, 2009 at 11:06 PM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>>>
>>>> On Dec 9, 2009, at 12:00 AM, Peng Yu wrote:
>>>>
>>>>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius
>>>>> <dwinsemius at comcast.net>
>>>>> wrote:
>>>>>>
>>>>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>>>>>
>>>>>>> I have the following code, which tests the split on a data.frame and
>>>>>>> the split on each column (as vector) separately. The runtimes are of
>>>>>>> 10 time difference. When m and k increase, the difference become even
>>>>>>> bigger.
>>>>>>>
>>>>>>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>>>>>>> in R? Can it be improved?
>>>>>>
>>>>>> You might want to look at the data.table package. The author calinms
>>>>>> significant speed improvements over dta.frames
>>>>>
>>>>> This bug has been found long time back and a package has been
>>>>> developed for it. Should the fix be integrated in data.frame rather
>>>>> than be implemented in an additional package?
>>>>
>>>> What bug?
>>>
>>> Is the slow speed in splitting a data.frame a performance bug?
>>>
>>
>> NO!
>>
>> The two computations are not equivalent.
>>
>> One is a list whose elements are split vectors, and the other is a list of
>> data.frames containing those vectors.
>
> I made a comparable example below. Still splitting data.frame is much
> slower comparing with the second way that I'm showing.
>
>> If you take the trouble to assemble that list of data frames from the list
>> of split vectors you will see that it is very time consuming.
>
> It is not as I show in the example below.

You are comparing creating a matrix to creating a data.frame.

> system.time(
+  spl<-   mapply(
+         function(...) {
+           cbind(...)
+         }
+         , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]]
+         )
+     )
    user  system elapsed
   1.204   0.016   1.478

system.time(
+  spl<-   mapply(
+         function(...) {
+           data.frame(...)
+         }
+         , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]],SIMPLIFY=FALSE
+         )
+     )
    user  system elapsed
  56.088   0.104  56.478
>

If you just want a list of matrices, use

> system.time(split.data.frame(x,f))
    user  system elapsed
   0.524   0.016   0.927

>
>> Read up on memory management issues. Think about what the computer actually
>> has to do in terms of memory access to split a data.frame versus split a
>> vector.
>
> I'd like to read more on how R do memory management. Would you please
> point me a good source?

I see now that the timing issue was not one of memory, but of doing more 
work (see Rprof results below) to create a data.frame. But if you are 
interested you might look at

Golub, Gene H.; Van Loan, Charles F. (1996), Matrix Computations (3rd 
ed.), Johns Hopkins, ISBN 978-0-8018-5414-9 .

and/or Google "BLAS memory"

>
> But again, R is not user friendly. It took me quite a long time to
> figure out that splitting a data.frame is a bottle neck in my program
> and reduce the problem into a test case.

See

 	?Rprof

and note where the 'self.time's are largest below( not in split or 
split.data.frame) :

> Rprof()
> res <- split(as.data.frame(x),f)
> Rprof(NULL)
> summaryRprof()
$by.self
                         self.time self.pct total.time total.pct
"attr"                      33.66     72.9      33.66      72.9
"[.data.frame"               3.26      7.1      45.70      98.9
"inherits"                   1.52      3.3       2.06       4.5
"anyDuplicated"              1.04      2.3       1.42       3.1
"[[.data.frame"              1.00      2.2       4.76      10.3
"[["                         0.74      1.6       5.50      11.9
"match"                      0.66      1.4       2.96       6.4
"<Anonymous>"                0.66      1.4       0.72       1.6
"sys.call"                   0.46      1.0       0.46       1.0
"all"                        0.38      0.8       0.38       0.8
"anyDuplicated.default"      0.36      0.8       0.38       0.8
"%in%"                       0.32      0.7       3.26       7.1
"names"                      0.26      0.6       0.26       0.6
"is.factor"                  0.24      0.5       2.30       5.0
"length"                     0.20      0.4       0.20       0.4
"attr<-"                     0.18      0.4       0.18       0.4
"as.character"               0.16      0.3       0.16       0.3
"["                          0.14      0.3      45.84      99.2
"-"                          0.14      0.3       0.14       0.3
"!"                          0.12      0.3       0.12       0.3
".Call"                      0.12      0.3       0.12       0.3
"!="                         0.10      0.2       0.10       0.2
"vector"                     0.06      0.1       0.26       0.6
"as.data.frame.matrix"       0.06      0.1       0.08       0.2
"|"                          0.06      0.1       0.06       0.1
"lapply"                     0.04      0.1      46.12      99.8
"<"                          0.04      0.1       0.04       0.1
"any"                        0.04      0.1       0.04       0.1
"is.na"                      0.04      0.1       0.04       0.1
".subset2"                   0.04      0.1       0.04       0.1
">"                          0.02      0.0       0.02       0.0
"as.vector"                  0.02      0.0       0.02       0.0
"dim"                        0.02      0.0       0.02       0.0
"is.matrix"                  0.02      0.0       0.02       0.0
"unique.default"             0.02      0.0       0.02       0.0
"split"                      0.00      0.0      46.20     100.0
"split.data.frame"           0.00      0.0      46.12      99.8
"FUN"                        0.00      0.0      45.84      99.2
"factor"                     0.00      0.0       0.24       0.5
"is.vector"                  0.00      0.0       0.24       0.5
"split.default"              0.00      0.0       0.24       0.5
"as.data.frame"              0.00      0.0       0.08       0.2
"unique"                     0.00      0.0       0.02       0.0

[output truncated]

Chuck

I don't know how memory
> management is done in R so that I don't know if it is possible to fix
> the problem for splitting a data.frame without perturbing the
> interface of data.frame. But if the speed of splitting data.frame is
> so slow, maybe it can be forbidden and an alternative can be
> documented somewhere.
>
>> ---
>>
>> And even if it were simply a matter of having code that is slow for some
>> application, that would not be a bug. Read the FAQ!
>
> The definition of a bug is on the FAQ is narrower than what I thought.
> No matter what a definition of a bug is, split() on a data.frame is
> perfectly legitimate operation (in terms of an interface). A quick fix
> to this problem is to at least single out the case where the argument
> is a data.frame, and to do what I have been doing below. Therefore,
> that is why I say this is a performance bug. Similar cases, where a
> faster alternative can be done but is not done, are perfect to call
> bugs, at least in many other languages.
>
>> m=300000
>> n=6
>> k=30000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> system.time(split(as.data.frame(x),f))
>   user  system elapsed
> 39.020   0.010  39.084
>>
>> v=lapply(
> +     1:dim(x)[[2]]
> +     , function(i) {
> +       split(x[,i],f)
> +     }
> +     )
>>
>> system.time(lapply(
> +         1:dim(x)[[2]]
> +         , function(i) {
> +           split(x[,i],f)
> +         }
> +         )
> +     )
>   user  system elapsed
>  2.520   0.000   2.526
>>
>> system.time(
> +     mapply(
> +         function(...) {
> +           cbind(...)
> +         }
> +         , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]]
> +         )
> +     )
>   user  system elapsed
>  0.920   0.000   0.927
>>
>
>
>>>>>
>>>>>> David.
>>>>>>>
>>>>>>>> system.time(split(as.data.frame(x),f))
>>>>>>>
>>>>>>>  user  system elapsed
>>>>>>>  1.700   0.010   1.786
>>>>>>>>
>>>>>>>> system.time(lapply(
>>>>>>>
>>>>>>> +         1:dim(x)[[2]]
>>>>>>> +         , function(i) {
>>>>>>> +           split(x[,i],f)
>>>>>>> +         }
>>>>>>> +         )
>>>>>>> +     )
>>>>>>>  user  system elapsed
>>>>>>>  0.170   0.000   0.167
>>>>>>>
>>>>>>> ###########
>>>>>>> m=30000
>>>>>>> n=6
>>>>>>> k=3000
>>>>>>>
>>>>>>> set.seed(0)
>>>>>>> x=replicate(n,rnorm(m))
>>>>>>> f=sample(1:k, size=m, replace=T)
>>>>>>>
>>>>>>> system.time(split(as.data.frame(x),f))
>>>>>>>
>>>>>>> system.time(lapply(
>>>>>>>      1:dim(x)[[2]]
>>>>>>>      , function(i) {
>>>>>>>        split(x[,i],f)
>>>>>>>      }
>>>>>>>      )
>>>>>>>  )
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>> David Winsemius, MD
>>>>>> Heritage Laboratories
>>>>>> West Hartford, CT
>>>>>>
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> David Winsemius, MD
>>>> Heritage Laboratories
>>>> West Hartford, CT
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> Charles C. Berry                            (858) 534-2098
>>                                            Dept of Family/Preventive
>> Medicine
>> E mailto:cberry at tajo.ucsd.edu               UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901