[R] Significant performance difference between split of adata.frame and split of vectors
Matthew Dowle
mdowle at mdowle.plus.com
Fri Dec 18 16:15:33 CET 2009
Thanks for suggesting data.table. It does have advantages in this example
but it has to be used in a particular way.
What does Peng actually want to achieve? I'll guess (but its only a guess)
that he doesn't actually need to hold the entire table in memory in a split
up format before doing something with each subset. He only needs each
subset at a time. Most of the time I guess most users are similar in this
regard.
data.table has a mechanism to do this, and do it in a compact syntax that is
quicker to program. When you use the 'by' argument, it evaluates the j
expression within the first subset by the 'by' and then doesn't need that
subset anymore. It then moves on to the next subset. So it operates in a
smaller memory footprint than i) creating an entire copy of the data in a
split up format, and then ii) lapply'ing through that new list doing
something on the subsets.
By guessing at something realistic we might want to actually do on each
subset, here is the example again :
> m = 30000
> n = 6
> k = 3000
> x = replicate(n,rnorm(m))
> f = sample(1:k,size=m,replace=TRUE)
> dt = data.table(x,f)
> dt[,sum(V1),by="f"] # this is the proper mechanism to split a
> data.table and operate on each subset. column names can be used as
> variables directly.
f V1
1 1.82720825112107
2 4.22721189592209
3 -0.409096014477913
...[ snip ] ...
> system.time(dt[,sum(V1),by="f"]) # same again but timing it this time
user system elapsed
0.13 0.00 0.12
>
> system.time(split(as.data.frame(x),f))
user system elapsed
1.55 0.00 1.55
>
So just splitting a data.frame is about 10 times slower than splitting the
data.table (in the proper way), and even then the data.frame split still
needs something (more program code) to loop through afterwards and do
something useful on the split up copy.
Its not just speed, but data.table may work when data.frame may fail. An
example could be constructed where x takes 55% of available ram. I'd expect
that splitting a data.frame should fail with an out of memory error (as it
needs another 55% for the full copy). A data.table 'by' in contrast should
work fine since it only requires the memory for the largest subset at any
one time (other than in the esoteric case of there being only one subset).
Having said the above I still consider 'by' in data.table to be slow, but
just relative to how fast it could be. See feature request #195 in the
package's R-forge project. There are some other more complicated
circumstances when the 'by' argument can result in slow performance
(possibly relative to data.frame or matrix methods) but this example doesn't
seem to be in that category, at least with what we've seen so far. The
problem in this thread so far looks to be due to split() itself.
Please note there is no split.data.table method and the default method does
not appear efficient. methods(split) shows there is no split method for
data.table but data.frame has its own special split method. The following
method has been added as a feature request in the data.table project :
split.data.table = function(...){stop("Use 'by' argument instead of
split() on data.table. See ?'[.data.table'")}
Hope this helps,
Matthew
"David Winsemius" <dwinsemius at comcast.net> wrote in message
news:CE3E3AFF-F2B5-4C80-9D00-00954130AA6B at comcast.net...
>
> On Dec 9, 2009, at 2:59 PM, Peng Yu wrote:
>
>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsemius at comcast.net
>> > wrote:
>>>
>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:
>>>
>>>> I have the following code, which tests the split on a data.frame and
>>>> the split on each column (as vector) separately. The runtimes are of
>>>> 10 time difference. When m and k increase, the difference become even
>>>> bigger.
>>>>
>>>> I'm wondering why the performance on data.frame is so bad. Is it a bug
>>>> in R? Can it be improved?
>>>
>>> You might want to look at the data.table package. The author calinms
>>> significant speed improvements over dta.frames
>>
>> 'data.table' doesn't seem to help. You can try the other set of m,n,k.
>> In both case, using as.data.frame is faster than using as.data.table.
>>
>> Please let me know if I understand what you meant.
>
> I was only suggesting that you look at it because it appeared in other
> situation to have efficiency advantages. As it turned out, that structure
> offered no advantage, when I tested it.
>
> --
> David.
>
>
>>
>>> m=10
>>> n=6
>>> k=3
>>>
>>> #m=300000
>>> #n=6
>>> #k=30000
>>>
>>> set.seed(0)
>>> x=replicate(n,rnorm(m))
>>> f=sample(1:k, size=m, replace=T)
>>>
>>> library(data.table)
>> Loading required package: ref
>> dim(refdata) and dimnames(refdata) no longer allow parameter ref=TRUE,
>> use dim(derefdata(refdata)), dimnames(derefdata(refdata)) instead
>>> system.time(split(as.data.frame(x),f))
>> user system elapsed
>> 0.000 0.000 0.003
>>> system.time(split(as.data.table(x),f))
>> user system elapsed
>> 0.010 0.000 0.011
>>
>>>>> system.time(split(as.data.frame(x),f))
>>>>
>>>> user system elapsed
>>>> 1.700 0.010 1.786
>>>>>
>>>>> system.time(lapply(
>>>>
>>>> + 1:dim(x)[[2]]
>>>> + , function(i) {
>>>> + split(x[,i],f)
>>>> + }
>>>> + )
>>>> + )
>>>> user system elapsed
>>>> 0.170 0.000 0.167
>>>>
>>>> ###########
>>>> m=30000
>>>> n=6
>>>> k=3000
>>>>
>>>> set.seed(0)
>>>> x=replicate(n,rnorm(m))
>>>> f=sample(1:k, size=m, replace=T)
>>>>
>>>> system.time(split(as.data.frame(x),f))
>>>>
>>>> system.time(lapply(
>>>> 1:dim(x)[[2]]
>>>> , function(i) {
>>>> split(x[,i],f)
>>>> }
>>>> )
>>>> )
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> David Winsemius, MD
>>> Heritage Laboratories
>>> West Hartford, CT
>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
More information about the R-help
mailing list