[R] which rows are duplicates?

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Mon Mar 30 14:26:24 CEST 2009


Michael Dewey wrote:
> At 05:07 30/03/2009, Aaron M. Swoboda wrote:
>> I would like to know which rows are duplicates of each other, not
>> simply that a row is duplicate of another row. In the following
>> example rows 1 and 3 are duplicates.
>>
>> > x <- c(1,3,1)
>> > y <- c(2,4,2)
>> > z <- c(3,4,3)
>> > data <- data.frame(x,y,z)
>>     x y z
>> 1 1 2 3
>> 2 3 4 4
>> 3 1 2 3
>

i don't have any solution significantly better than what you have
already been given.  but i have a warning instead.

in the below, you use both 'duplicated' and 'unique' on data frames, and
the proposed solution relies on the latter.  you may want to try to
avoid both when working with data frames;  this is because of how they
do (or don't) work.

duplicated (and unique, which calls duplicated) simply pastes the
content of each row into a *string*, and then works on the strings. 
this means that NAs in the data frame are converted to "NA"s, and "NA"
== "NA", obviously, so that rows that include NAs and are otherwise
identical will be considered *identical*.

that's not bad (yet), but you should be aware.  however, duplicated has
a parameter named 'incomparables', explained in ?duplicated as follows:

"
incomparables: a vector of values that cannot be compared. 'FALSE' is a
          special value, meaning that all values can be compared, and
          may be the only value accepted for methods other than the
          default.  It will be coerced internally to the same type as
          'x'.
"

and also

"
     Values in 'incomparables' will never be marked as duplicated. This
     is intended to be used for a fairly small set of values and will
     not be efficient for a very large set.
"

that is, for example:

    vector = c(NA, NA)
    duplicated(vector)
    # [1] FALSE TRUE
    duplicated(vector), incomparables=NA)
    # [1] FALSE FALSE

    list = list(NA, NA)
    duplicated(list)
    # [1] FALSE TRUE
    duplicated(list, incomparables=NA)
    # [1] FALSE FALSE


what the documentation *fails* to tell you is that the parameter
'incomparables' is defunct in duplicated.data.frame, which you can see
in its source code (below), or in the following example:

    # data as above, or any data frame
    duplicated(data, incomparables=NA)
    # Error in if (!is.logical(incomparables) || incomparables)
.NotYetUsed("incomparables != FALSE") :
    #   missing value where TRUE/FALSE needed

the error message here is *confusing*.  the error is raised because the
author of the code made a mistake and apparently haven't carefully
examined and tested his product;  the code goes:

    duplicated.data.frame
    # function (x, incomparables = FALSE, fromLast = FALSE, ...)
    # {
    #    if (!is.logical(incomparables) || incomparables)
    #        .NotYetUsed("incomparables != FALSE")
    #    duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
    # }
    # <environment: namespace:base>

clearly, the intention here is to raise an error with a (still hardly
clear) message as in:

    .NotYetUsed("incomparables != FALSE")
    # Error: argument 'incomparables != FALSE' is not used (yet)

but instead, if(NA) is evaluated (because '!is.logical(NA) || NA'
evaluates, *obviously*, to NA) and hence the uninformative error message.

take home point:  rtfm, *but* don't believe it.

vQ


> Does this do what you want?
> > x <- c(1,3,1)
> > y <- c(2,4,2)
> > z <- c(3,4,3)
> > data <- data.frame(x,y,z)
> > data.u <- unique(data)
> > data.u
>   x y z
> 1 1 2 3
> 2 3 4 4
> > data.u <- cbind(data.u, set = 1:nrow(data.u))
> > merge(data, data.u)
>   x y z set
> 1 1 2 3   1
> 2 1 2 3   1
> 3 3 4 4   2
>
> You need to do a bit more work to get them back into the original row
> order if that is essential.
>
>
>
>> I can't figure out how to get R to tell me that observation 1 and 3
>> are the same.  It seems like the "duplicated" and "unique" functions
>> should be able to help me out, but I am stumped.
>>
>> For instance, if I use "duplicated" ...
>>
>> > duplicated(data)
>> [1] FALSE FALSE TRUE
>>
>> it tells me that row 3 is a duplicate, but not which row it matches.
>> How do I figure out WHICH row it matches?
>>
>> And If I use "unique"...
>>
>> > unique(data)
>>     x y z
>> 1 1 2 3
>> 2 3 4 4
>>
>> I see that rows 1 and 2 are unique, leaving me to infer that row 3 was
>> a duplicate, but again it doesn't tell me which row it was a duplicate
>> of (as far as I can tell). Am I missing something?
>>
>> How can I determine that row 3 is a duplicate OF ROW 1?
>>
>> Thanks,
>>
>> Aaron
>>
>>
>
> Michael Dewey
> http://www.aghmed.fsnet.co.uk
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
-------------------------------------------------------------------------------
Wacek Kusnierczyk, MD PhD

Email: waku at idi.ntnu.no
Phone: +47 73591875, +47 72574609

Department of Computer and Information Science (IDI)
Faculty of Information Technology, Mathematics and Electrical Engineering (IME)
Norwegian University of Science and Technology (NTNU)
Sem Saelands vei 7, 7491 Trondheim, Norway
Room itv303

Bioinformatics & Gene Regulation Group
Department of Cancer Research and Molecular Medicine (IKM)
Faculty of Medicine (DMF)
Norwegian University of Science and Technology (NTNU)
Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway
Room 231.05.060




More information about the R-help mailing list