[Rd] duplicates() function

Petr Savicky savicky at cs.cas.cz
Tue Apr 12 15:26:49 CEST 2011


On Mon, Apr 11, 2011 at 02:05:11PM -0400, Duncan Murdoch wrote:
> On 08/04/2011 11:39 AM, Joshua Ulrich wrote:
> >On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch
> ><murdoch.duncan at gmail.com>  wrote:
> >>  On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
> >>>
> >>>  How about:
> >>>
> >>>  y<- rep(NA,length(x))
> >>>  y[duplicated(x)]<- match(x[duplicated(x)] ,x)
> >>
> >>  That's a nice solution for vectors.  Unfortunately for me, I have a 
> >matrix
> >>  (which duplicated() handles by checking whole rows).  So a better 
> >example
> >>  that I should have posted would be
> >>
> >>  x<-  cbind(1, c(9,7,9,3,7) )
> >>
> >>  and I'd still like the same output
> >>
> >For a matrix, could you apply the same strategy used in duplicated()?
> >
> >y<- rep(NA,NROW(x))
> >temp<- apply(x, 1, function(x) paste(x, collapse="\r"))
> >y[duplicated(temp)]<- match(temp[duplicated(temp)], temp)
> 
> Since this thread hasn't ended, I will say that I think this solution is 
> the best I've seen for my specific problem.  I was actually surprised 
> that duplicated() did the string concatenation trick, but since it does, 
> it makes a lot of sense to do the same in duplicates().

Consistency with duplicated() is a good argument.

Let me point out, although it goes beyond the original question, that
sorting may be used to compute duplicated() in a way, which is more
efficient than the paste() approach according to the test below.

  duplicatedSort <- function(df)
  {
      n <- nrow(df)
      if (n == 1) {
          return(FALSE)
      } else {
          s <- do.call(order, as.data.frame(df))
          equal <- df[s[2:n], , drop=FALSE] == df[s[1:(n-1)], , drop=FALSE]
          dup <- c(FALSE, rowSums(equal) == ncol(df))
          return(dup[order(s)])
      }
  }

The following tests efficiency for a character matrix.
 
  m <- 1000
  n <- 4
  a <- matrix(as.character(sample(10, m*n, replace=TRUE)), nrow=m, ncol=n)
  system.time(out1 <- duplicatedSort(a))
  system.time(out2 <- duplicated(a))
  identical(out1, out2)
  table(out1)

I obtained, for example,

     user  system elapsed 
    0.003   0.000   0.003 
  
     user  system elapsed 
    0.012   0.000   0.011 
  
  [1] TRUE

  out1
  FALSE  TRUE 
    942    58 

For a numeric matrix, the ratio of the running times is larger in
the same direction.

Petr Savicky.



More information about the R-devel mailing list