[R] duplicated() with long vectors

Wed Dec 5 23:10:53 CET 2012

> What I was trying to do was get a vector saying, for each item,
> whether that item is the same as the preceding item. Now that I think
> of it, I could do this easily by copying the vector, shifting it over
> one (by removing the first element and adding something to the end),
> and then just compare the elements of the two vectors directly.

Right. Did you look at rle() yet?

Though for your particular simple case,

> system.time(verylong[1:(n-1)] == verylong[2:n])
   user  system elapsed
  0.001   0.000   0.002

is nearly instantaneous.

On Wed, Dec 5, 2012 at 5:04 PM, Stephen Politzer-Ahles
<politzerahless at gmail.com> wrote:
> Hi Sarah,
>
> Thanks a lot for your explanation. I was mistakenly under the
> impression that duplicated() only looked at immediately preceding
> element, not all preceding elements.
>
> What I was trying to do was get a vector saying, for each item,
> whether that item is the same as the preceding item. Now that I think
> of it, I could do this easily by copying the vector, shifting it over
> one (by removing the first element and adding something to the end),
> and then just compare the elements of the two vectors directly.
>
> Best,
> Steve
>
> On Wed, Dec 5, 2012 at 3:08 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
>> Hi,
>>
>> duplicated() doesn't just look at consecutive values, but anywhere in
>> the object. Since your 12320-element vector has only 48 separate
>> values, and all of them occur before the last 30 elements, so
>> duplicated() returns TRUE.
>>
>> You might be looking for something involving rle(). What are you
>> trying to accomplish?
>>
>> Sarah
>>
>> On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles
>> <politzerahless at gmail.com> wrote:
>>> Hello,
>>>
>>> duplicated() does not seem to work for a long vector. For example, if
>>> you download the data from
>>> https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector
>>> with about 12,000 numbers) and then run the following code which does
>>> duplicated() over the whole vector but just shows the last 30
>>> elements:
>>>
>>> data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) )
>>>
>>> you'll see that at the end of the very long vector everything is
>>> listed as a duplicate of the preceding element (even though it
>>> shouldn't be). On the other hand, if you run the following code which
>>> just takes out the last 30 elements of the vector and does duplicated
>>> on them:
>>>
>>> data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) )
>>>
>>> you get the correct results (FALSE shows up wherever the value in the
>>> first column changes). Does anyone know why this happens, and if
>>> there's a fix? I notice the documentation for duplicated() says: "Long
>>> vectors are supported for the default method of duplicated, but may
>>> only be usable if nmax is supplied."  But I've tried running this with
>>> a high value of nmax given, and it still gives me the same problem.
>>>
>>> So far the only way I've figured out to get this duplicated()-like
>>> vector is to use a for loop going through one item at a time, but that
>>> takes about a minute to run.
>>>
>>> Best,
>>> Steve Politzer-Ahles
>>>
>>
>>

--
Sarah Goslee
http://www.functionaldiversity.org