[R] duplicated() with long vectors

Wed Dec 5 23:24:42 CET 2012

Sorry, that's my mistake, I should not have said 'long vector'; mine
is just a normal vector. I'm not actually using a development version.

Best,
Steve

On Wed, Dec 5, 2012 at 4:22 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>
>
> And BTW, 'long vector' is a technical term in R: not 12,000, but more than 2
> billion elements.  You will hear it a lot more in the run-up to the next
> 'minor' release of R (currently R-devel, maybe 2.16.0-to-be, which is the
> only version from which that quote comes that I am aware of).
>
> The posting guide asked for 'at a minimum' information: if you are using an
> unreleased development version of R you really must tell us (and should not
> be reporting to the R-help list).
>
>
>>
>> Sarah
>>
>> On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles
>> <politzerahless at gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> duplicated() does not seem to work for a long vector. For example, if
>>> you download the data from
>>> https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector
>>> with about 12,000 numbers) and then run the following code which does
>>> duplicated() over the whole vector but just shows the last 30
>>> elements:
>>>
>>> data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) )
>>>
>>> you'll see that at the end of the very long vector everything is
>>> listed as a duplicate of the preceding element (even though it
>>> shouldn't be). On the other hand, if you run the following code which
>>> just takes out the last 30 elements of the vector and does duplicated
>>> on them:
>>>
>>> data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) )
>>>
>>> you get the correct results (FALSE shows up wherever the value in the
>>> first column changes). Does anyone know why this happens, and if
>>> there's a fix? I notice the documentation for duplicated() says: "Long
>>> vectors are supported for the default method of duplicated, but may
>>> only be usable if nmax is supplied."  But I've tried running this with
>>> a high value of nmax given, and it still gives me the same problem.
>>>
>>> So far the only way I've figured out to get this duplicated()-like
>>> vector is to use a for loop going through one item at a time, but that
>>> takes about a minute to run.
>>>
>>> Best,
>>> Steve Politzer-Ahles
>
>
>
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Stephen Politzer-Ahles
University of Kansas
Linguistics Department
http://people.ku.edu/~sjpa/