[R] how to efficiently compute set unique?
G FANG
fanggangsw at gmail.com
Tue Jun 22 07:56:15 CEST 2010
Hi All,
I think I figured out what's the problem. I have been a matlab user,
so in all my codes, I maintain the as.matrix format, which is much
slower to do unique.
I tried to not do the as.matrix conversion, and now it takes just few
seconds to do unique, as well as other computations.
Thanks a lot Duncan, Steve, David, and Douglas,
Hopefully, this case can also help future matlab->R users who got
stucked in the matlab thinking style.
Gang
On Mon, Jun 21, 2010 at 7:01 PM, Douglas Bates <bates at stat.wisc.edu> wrote:
> On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>>
>> On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote:
>>
>>> On 21/06/2010 9:06 PM, G FANG wrote:
>>>>
>>>> Hi,
>>>>
>>>> I want to get the unique set from a large numeric k by 1 vector, k is
>>>> in tens of millions
>>>>
>>>> when I used the matlab function unique, it takes less than 10 secs
>>>>
>>>> but when I tried to use the unique in R with similar CPU and memory,
>>>> it is not done in minutes
>>>>
>>>> I am wondering, am I using the function in the right way?
>>>>
>>>> dim(cntxtn)
>>>> [1] 13584763 1
>>>> uniqueCntxt = unique(cntxtn); # this is taking really long
>>>
>>> What type is cntxtn? If I do that sort of thing on a numeric vector, it's
>>> quite fast:
>>>
>>> > x <- sample(100000, size=13584763, replace=T)
>>> > system.time(unique(x))
>>> user system elapsed
>>> 3.61 0.14 3.75
>>
>> If it's a factor, it could be as simple as:
>>
>> levels(cntxtn) # since the work of "unique-ification" has already been
>> done.
>
> Not quite. When you generate a factor, as you do in your example, the
> levels correspond to the unique values of the original vector. But
> when you take a subset of a factor the levels are preserved intact,
> even if some of those levels do not occur in the subset. This is why
> there are unusual arguments with names like drop.unused.levels in
> functions like model.frame. It is also a subtle difference in the
> behavior of factor(x) and as.factor(x) when x is already a factor.
>
>> ff <- factor(sample.int(200, 1000, replace = TRUE))
>> ff1 <- ff[1:40]
>> length(levels(ff))
> [1] 199
>> length(levels(ff1))
> [1] 199
>> length(levels(as.factor(ff1)))
> [1] 199
>> length(levels(factor(ff1)))
> [1] 34
>
>>> x <- factor(sample(100000, size=13584763, replace=T))
>>> system.time(levels(x))
>> user system elapsed
>> 0 0 0
>>> system.time(y <- levels(x))
>> user system elapsed
>> 0 0 0
>
More information about the R-help
mailing list