[R] Significant performance difference between split of a data.frame and split of vectors
Charles C. Berry
cberry at tajo.ucsd.edu
Wed Dec 9 22:15:55 CET 2009
On Wed, 9 Dec 2009, Peng Yu wrote:
> On Wed, Dec 9, 2009 at 11:20 AM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
>> On Wed, 9 Dec 2009, Peng Yu wrote:
>>
>>> On Tue, Dec 8, 2009 at 11:06 PM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>>>
>>>> On Dec 9, 2009, at 12:00 AM, Peng Yu wrote:
>>>>
>>>>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius
>>>>> <dwinsemius at comcast.net>
>>>>> wrote:
>>>>>>
>>>>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>>>>>
>>>>>>> I have the following code, which tests the split on a data.frame and
>>>>>>> the split on each column (as vector) separately. The runtimes are of
>>>>>>> 10 time difference. When m and k increase, the difference become even
>>>>>>> bigger.
>>>>>>>
>>>>>>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>>>>>>> in R? Can it be improved?
>>>>>>
>>>>>> You might want to look at the data.table package. The author calinms
>>>>>> significant speed improvements over dta.frames
>>>>>
>>>>> This bug has been found long time back and a package has been
>>>>> developed for it. Should the fix be integrated in data.frame rather
>>>>> than be implemented in an additional package?
>>>>
>>>> What bug?
>>>
>>> Is the slow speed in splitting a data.frame a performance bug?
>>>
>>
>> NO!
>>
>> The two computations are not equivalent.
>>
>> One is a list whose elements are split vectors, and the other is a list of
>> data.frames containing those vectors.
>
> I made a comparable example below. Still splitting data.frame is much
> slower comparing with the second way that I'm showing.
>
>> If you take the trouble to assemble that list of data frames from the list
>> of split vectors you will see that it is very time consuming.
>
> It is not as I show in the example below.
You are comparing creating a matrix to creating a data.frame.
> system.time(
+ spl<- mapply(
+ function(...) {
+ cbind(...)
+ }
+ , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]]
+ )
+ )
user system elapsed
1.204 0.016 1.478
system.time(
+ spl<- mapply(
+ function(...) {
+ data.frame(...)
+ }
+ , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]],SIMPLIFY=FALSE
+ )
+ )
user system elapsed
56.088 0.104 56.478
>
If you just want a list of matrices, use
> system.time(split.data.frame(x,f))
user system elapsed
0.524 0.016 0.927
>
>> Read up on memory management issues. Think about what the computer actually
>> has to do in terms of memory access to split a data.frame versus split a
>> vector.
>
> I'd like to read more on how R do memory management. Would you please
> point me a good source?
I see now that the timing issue was not one of memory, but of doing more
work (see Rprof results below) to create a data.frame. But if you are
interested you might look at
Golub, Gene H.; Van Loan, Charles F. (1996), Matrix Computations (3rd
ed.), Johns Hopkins, ISBN 978-0-8018-5414-9 .
and/or Google "BLAS memory"
>
> But again, R is not user friendly. It took me quite a long time to
> figure out that splitting a data.frame is a bottle neck in my program
> and reduce the problem into a test case.
See
?Rprof
and note where the 'self.time's are largest below( not in split or
split.data.frame) :
> Rprof()
> res <- split(as.data.frame(x),f)
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"attr" 33.66 72.9 33.66 72.9
"[.data.frame" 3.26 7.1 45.70 98.9
"inherits" 1.52 3.3 2.06 4.5
"anyDuplicated" 1.04 2.3 1.42 3.1
"[[.data.frame" 1.00 2.2 4.76 10.3
"[[" 0.74 1.6 5.50 11.9
"match" 0.66 1.4 2.96 6.4
"<Anonymous>" 0.66 1.4 0.72 1.6
"sys.call" 0.46 1.0 0.46 1.0
"all" 0.38 0.8 0.38 0.8
"anyDuplicated.default" 0.36 0.8 0.38 0.8
"%in%" 0.32 0.7 3.26 7.1
"names" 0.26 0.6 0.26 0.6
"is.factor" 0.24 0.5 2.30 5.0
"length" 0.20 0.4 0.20 0.4
"attr<-" 0.18 0.4 0.18 0.4
"as.character" 0.16 0.3 0.16 0.3
"[" 0.14 0.3 45.84 99.2
"-" 0.14 0.3 0.14 0.3
"!" 0.12 0.3 0.12 0.3
".Call" 0.12 0.3 0.12 0.3
"!=" 0.10 0.2 0.10 0.2
"vector" 0.06 0.1 0.26 0.6
"as.data.frame.matrix" 0.06 0.1 0.08 0.2
"|" 0.06 0.1 0.06 0.1
"lapply" 0.04 0.1 46.12 99.8
"<" 0.04 0.1 0.04 0.1
"any" 0.04 0.1 0.04 0.1
"is.na" 0.04 0.1 0.04 0.1
".subset2" 0.04 0.1 0.04 0.1
">" 0.02 0.0 0.02 0.0
"as.vector" 0.02 0.0 0.02 0.0
"dim" 0.02 0.0 0.02 0.0
"is.matrix" 0.02 0.0 0.02 0.0
"unique.default" 0.02 0.0 0.02 0.0
"split" 0.00 0.0 46.20 100.0
"split.data.frame" 0.00 0.0 46.12 99.8
"FUN" 0.00 0.0 45.84 99.2
"factor" 0.00 0.0 0.24 0.5
"is.vector" 0.00 0.0 0.24 0.5
"split.default" 0.00 0.0 0.24 0.5
"as.data.frame" 0.00 0.0 0.08 0.2
"unique" 0.00 0.0 0.02 0.0
[output truncated]
Chuck
I don't know how memory
> management is done in R so that I don't know if it is possible to fix
> the problem for splitting a data.frame without perturbing the
> interface of data.frame. But if the speed of splitting data.frame is
> so slow, maybe it can be forbidden and an alternative can be
> documented somewhere.
>
>> ---
>>
>> And even if it were simply a matter of having code that is slow for some
>> application, that would not be a bug. Read the FAQ!
>
> The definition of a bug is on the FAQ is narrower than what I thought.
> No matter what a definition of a bug is, split() on a data.frame is
> perfectly legitimate operation (in terms of an interface). A quick fix
> to this problem is to at least single out the case where the argument
> is a data.frame, and to do what I have been doing below. Therefore,
> that is why I say this is a performance bug. Similar cases, where a
> faster alternative can be done but is not done, are perfect to call
> bugs, at least in many other languages.
>
>> m=300000
>> n=6
>> k=30000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> system.time(split(as.data.frame(x),f))
> user system elapsed
> 39.020 0.010 39.084
>>
>> v=lapply(
> + 1:dim(x)[[2]]
> + , function(i) {
> + split(x[,i],f)
> + }
> + )
>>
>> system.time(lapply(
> + 1:dim(x)[[2]]
> + , function(i) {
> + split(x[,i],f)
> + }
> + )
> + )
> user system elapsed
> 2.520 0.000 2.526
>>
>> system.time(
> + mapply(
> + function(...) {
> + cbind(...)
> + }
> + , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]]
> + )
> + )
> user system elapsed
> 0.920 0.000 0.927
>>
>
>
>>>>>
>>>>>> David.
>>>>>>>
>>>>>>>> system.time(split(as.data.frame(x),f))
>>>>>>>
>>>>>>> user system elapsed
>>>>>>> 1.700 0.010 1.786
>>>>>>>>
>>>>>>>> system.time(lapply(
>>>>>>>
>>>>>>> + 1:dim(x)[[2]]
>>>>>>> + , function(i) {
>>>>>>> + split(x[,i],f)
>>>>>>> + }
>>>>>>> + )
>>>>>>> + )
>>>>>>> user system elapsed
>>>>>>> 0.170 0.000 0.167
>>>>>>>
>>>>>>> ###########
>>>>>>> m=30000
>>>>>>> n=6
>>>>>>> k=3000
>>>>>>>
>>>>>>> set.seed(0)
>>>>>>> x=replicate(n,rnorm(m))
>>>>>>> f=sample(1:k, size=m, replace=T)
>>>>>>>
>>>>>>> system.time(split(as.data.frame(x),f))
>>>>>>>
>>>>>>> system.time(lapply(
>>>>>>> 1:dim(x)[[2]]
>>>>>>> , function(i) {
>>>>>>> split(x[,i],f)
>>>>>>> }
>>>>>>> )
>>>>>>> )
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>> David Winsemius, MD
>>>>>> Heritage Laboratories
>>>>>> West Hartford, CT
>>>>>>
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> David Winsemius, MD
>>>> Heritage Laboratories
>>>> West Hartford, CT
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> Charles C. Berry (858) 534-2098
>> Dept of Family/Preventive
>> Medicine
>> E mailto:cberry at tajo.ucsd.edu UC San Diego
>> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list