[Rd] duplicated() variation that goes both ways to capture all duplicates

Liviu Andronic landronimirc at gmail.com
Mon Jul 23 14:49:02 CEST 2012


Dear all
The trouble with the current duplicated() function in is that it can
report duplicates while searching fromFirst _or_ fromLast, but not
both ways. Often users will want to identify and extract all the
copies of the item that has duplicates, not only the duplicates
themselves.

To take the example from the man page:
> data(iris)
> iris[duplicated(iris), ]  ##duplicates while searching "fromFirst"
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
143          5.8         2.7          5.1         1.9 virginica
> iris[duplicated(iris, fromLast=T), ]  ##duplicates while searching "fromLast"
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
102          5.8         2.7          5.1         1.9 virginica


To extract all the copies of the concerned items ("original" and
duplicates) one would need to do something like this:
> iris[(duplicated(iris) | duplicated(iris, fromLast=T)), ]  ##duplicates while searching "bothWays"
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
102          5.8         2.7          5.1         1.9 virginica
143          5.8         2.7          5.1         1.9 virginica


Unfortunately this is unnecessarily long and convoluted. Short of a
'bothWays' argument in duplicated(), I came up with a small wrapper
that simplifies the above:
duplicated2 <-
    function(x, bothWays=TRUE, ...)
    {
        if(!bothWays) {
            return(duplicated(x, ...))
        } else if(bothWays) {
                return((duplicated(x, ...) | duplicated(x, fromLast=TRUE, ...)))
        }
    }


Now the above can be achieved simply via:
> iris[duplicated2(iris), ]  ##duplicates while searching "bothWays"
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
102          5.8         2.7          5.1         1.9 virginica
143          5.8         2.7          5.1         1.9 virginica


So here's my inquiry: Would the R Core consider adding such
functionality in 'base' R? Either the---suitably cleaned
up---duplicated2() function above, or a "bothWays" argument in
duplicated() itself? Either of the two would improve user convenience
and reduce confusion. (In my case it took some time before I
understood the correct approach to this problem.)

Regards
Liviu


-- 
Do you know how to read?
http://www.alienetworks.com/srtest.cfm
http://goodies.xfce.org/projects/applications/xfce4-dict#speed-reader
Do you know how to write?
http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail



More information about the R-devel mailing list