[Rd] duplicated.data.frame {was "[R] which rows are duplicates?"}

Martin Maechler maechler at stat.math.ethz.ch
Tue Mar 31 15:20:28 CEST 2009


>>>>> "WK" == Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>>     on Mon, 30 Mar 2009 14:26:24 +0200 writes:

    WK> Michael Dewey wrote:
    >> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
    >>> I would like to know which rows are duplicates of each other, not
    >>> simply that a row is duplicate of another row. In the following
    >>> example rows 1 and 3 are duplicates.
    >>> 
    >>> > x <- c(1,3,1)
    >>> > y <- c(2,4,2)
    >>> > z <- c(3,4,3)
    >>> > data <- data.frame(x,y,z)
    >>> x y z
    >>> 1 1 2 3
    >>> 2 3 4 4
    >>> 3 1 2 3
    >> 

    WK> i don't have any solution significantly better than what you have
    WK> already been given.  but i have a warning instead.

    WK> in the below, you use both 'duplicated' and 'unique' on data frames, and
    WK> the proposed solution relies on the latter.  you may want to try to
    WK> avoid both when working with data frames;  this is because of how they
    WK> do (or don't) work.

    WK> duplicated (and unique, which calls duplicated) simply pastes the
    WK> content of each row into a *string*, and then works on the strings. 
    WK> this means that NAs in the data frame are converted to "NA"s, and "NA"
    WK> == "NA", obviously, so that rows that include NAs and are otherwise
    WK> identical will be considered *identical*.

    WK> that's not bad (yet), but you should be aware.  however, duplicated has
    WK> a parameter named 'incomparables', explained in ?duplicated as follows:

    WK> "
    WK> incomparables: a vector of values that cannot be compared. 'FALSE' is a
    WK> special value, meaning that all values can be compared, and
    WK> may be the only value accepted for methods other than the
    WK> default.  It will be coerced internally to the same type as
    WK> 'x'.
    WK> "

    WK> and also

    WK> "
    WK> Values in 'incomparables' will never be marked as duplicated. This
    WK> is intended to be used for a fairly small set of values and will
    WK> not be efficient for a very large set.
    WK> "

    WK> that is, for example:

    WK> vector = c(NA, NA)
    WK> duplicated(vector)
    WK> # [1] FALSE TRUE
    WK> duplicated(vector), incomparables=NA)
    WK> # [1] FALSE FALSE

    WK> list = list(NA, NA)
    WK> duplicated(list)
    WK> # [1] FALSE TRUE
    WK> duplicated(list, incomparables=NA)
    WK> # [1] FALSE FALSE


    WK> what the documentation *fails* to tell you is that the parameter
    WK> 'incomparables' is defunct

No, not "defunct", but the contrary of it,
"not yet implemented" !

    WK> in duplicated.data.frame, which you can see in its
    WK> source code (below), or in the following example:

    WK> # data as above, or any data frame
    WK> duplicated(data, incomparables=NA)
    WK> # Error in if (!is.logical(incomparables) || incomparables)
    WK> .NotYetUsed("incomparables != FALSE") :
    WK> #   missing value where TRUE/FALSE needed

    WK> the error message here is *confusing*.  
yes!

    WK> the error is raised because the
    WK> author of the code made a mistake and apparently haven't carefully
((plural or singular ??))

    WK> examined and tested his product;  the code goes:
((aah, ... "singular" ...))

    WK> duplicated.data.frame
    WK> # function (x, incomparables = FALSE, fromLast = FALSE, ...)
    WK> # {
    WK> #    if (!is.logical(incomparables) || incomparables)
    WK> #        .NotYetUsed("incomparables != FALSE")
    WK> #    duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
    WK> # }
    WK> # <environment: namespace:base>

    WK> clearly, the intention here is to raise an error with a (still hardly
    WK> clear) message as in:

    WK> .NotYetUsed("incomparables != FALSE")
    WK> # Error: argument 'incomparables != FALSE' is not used (yet)

    WK> but instead, if(NA) is evaluated (because '!is.logical(NA) || NA'
    WK> evaluates, *obviously*, to NA) and hence the uninformative error message.

    WK> take home point:  rtfm, *but* don't believe it.

and then be helpful to the R community and send a bug report
*with* a patch if {as in this case} you are able to...

Well, that' no longer needed here,
I'll fix that easily myself.

Martin

    WK> vQ



More information about the R-devel mailing list