[Rd] duplicated.data.frame {was "[R] which rows are duplicates?"}
Martin Maechler
maechler at stat.math.ethz.ch
Tue Mar 31 15:20:28 CEST 2009
>>>>> "WK" == Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>
>>>>> on Mon, 30 Mar 2009 14:26:24 +0200 writes:
WK> Michael Dewey wrote:
>> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>>> I would like to know which rows are duplicates of each other, not
>>> simply that a row is duplicate of another row. In the following
>>> example rows 1 and 3 are duplicates.
>>>
>>> > x <- c(1,3,1)
>>> > y <- c(2,4,2)
>>> > z <- c(3,4,3)
>>> > data <- data.frame(x,y,z)
>>> x y z
>>> 1 1 2 3
>>> 2 3 4 4
>>> 3 1 2 3
>>
WK> i don't have any solution significantly better than what you have
WK> already been given. but i have a warning instead.
WK> in the below, you use both 'duplicated' and 'unique' on data frames, and
WK> the proposed solution relies on the latter. you may want to try to
WK> avoid both when working with data frames; this is because of how they
WK> do (or don't) work.
WK> duplicated (and unique, which calls duplicated) simply pastes the
WK> content of each row into a *string*, and then works on the strings.
WK> this means that NAs in the data frame are converted to "NA"s, and "NA"
WK> == "NA", obviously, so that rows that include NAs and are otherwise
WK> identical will be considered *identical*.
WK> that's not bad (yet), but you should be aware. however, duplicated has
WK> a parameter named 'incomparables', explained in ?duplicated as follows:
WK> "
WK> incomparables: a vector of values that cannot be compared. 'FALSE' is a
WK> special value, meaning that all values can be compared, and
WK> may be the only value accepted for methods other than the
WK> default. It will be coerced internally to the same type as
WK> 'x'.
WK> "
WK> and also
WK> "
WK> Values in 'incomparables' will never be marked as duplicated. This
WK> is intended to be used for a fairly small set of values and will
WK> not be efficient for a very large set.
WK> "
WK> that is, for example:
WK> vector = c(NA, NA)
WK> duplicated(vector)
WK> # [1] FALSE TRUE
WK> duplicated(vector), incomparables=NA)
WK> # [1] FALSE FALSE
WK> list = list(NA, NA)
WK> duplicated(list)
WK> # [1] FALSE TRUE
WK> duplicated(list, incomparables=NA)
WK> # [1] FALSE FALSE
WK> what the documentation *fails* to tell you is that the parameter
WK> 'incomparables' is defunct
No, not "defunct", but the contrary of it,
"not yet implemented" !
WK> in duplicated.data.frame, which you can see in its
WK> source code (below), or in the following example:
WK> # data as above, or any data frame
WK> duplicated(data, incomparables=NA)
WK> # Error in if (!is.logical(incomparables) || incomparables)
WK> .NotYetUsed("incomparables != FALSE") :
WK> # missing value where TRUE/FALSE needed
WK> the error message here is *confusing*.
yes!
WK> the error is raised because the
WK> author of the code made a mistake and apparently haven't carefully
((plural or singular ??))
WK> examined and tested his product; the code goes:
((aah, ... "singular" ...))
WK> duplicated.data.frame
WK> # function (x, incomparables = FALSE, fromLast = FALSE, ...)
WK> # {
WK> # if (!is.logical(incomparables) || incomparables)
WK> # .NotYetUsed("incomparables != FALSE")
WK> # duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
WK> # }
WK> # <environment: namespace:base>
WK> clearly, the intention here is to raise an error with a (still hardly
WK> clear) message as in:
WK> .NotYetUsed("incomparables != FALSE")
WK> # Error: argument 'incomparables != FALSE' is not used (yet)
WK> but instead, if(NA) is evaluated (because '!is.logical(NA) || NA'
WK> evaluates, *obviously*, to NA) and hence the uninformative error message.
WK> take home point: rtfm, *but* don't believe it.
and then be helpful to the R community and send a bug report
*with* a patch if {as in this case} you are able to...
Well, that' no longer needed here,
I'll fix that easily myself.
Martin
WK> vQ
More information about the R-devel
mailing list