[R] Deduping in R by multiple variables

Thu Aug 30 01:00:12 CEST 2012

You can find out which rows of a data.frame called dataFrame
are duplicates of previous rows with
   dups <- duplicated(dataFrame)
To make a new data.frame without them do
   duplessDataFrame <- dataFrame[!dups, ]
You could use unique(dataFrame), but, as in your examples, I
think one often wants to remove duplicates based on only
some of the columns.  E.g., with the following data.frame
dataFrame <- data.frame(Name=LETTERS[1:9],
                                               One=rep(1:3,3),
                                               Two=c(11,12,13,11,11,12,12,13,13),
                                              Three=c(101,102,103,101,101,103,101,102,103))
we get
  > dataFrame
    Name One Two Three
  1    A   1  11   101
  2    B   2  12   102
  3    C   3  13   103
  4    D   1  11   101
  5    E   2  11   101
  6    F   3  12   103
  7    G   1  12   101
  8    H   2  13   102
  9    I   3  13   103
  > duplicated(dataFrame)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  > dups123 <- duplicated(dataFrame[,c("One","Two","Three")])
  > dups123
  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
  > dataFrame[!dups123, ]
    Name One Two Three
  1    A   1  11   101
  2    B   2  12   102
  3    C   3  13   103
  5    E   2  11   101
  6    F   3  12   103
  7    G   1  12   101
  8    H   2  13   102

Your first expression
   detail3 <- [!duplicated(...)]
must have caused a syntax error, as "[" is the subscript operator
and requires something before it, as in datail2[...].

To see why your second attempt
   detail3 <-
   unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
           detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
           detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
           detail2$STKFUL)])
will not do what you want (even if it did finish in a reasonable amount of time)
break it into pieces and use the example dataset above.  You asked it to extract
the columns specified by 'tmp' where 'tmp' was constructed by:
  > print(tmp <- c(dataFrame$One, dataFrame$Two, dataFrame$Three))
   [1]   1   2   3   1   2   3   1   2   3  11  12
  [12]  13  11  11  12  12  13  13 101 102 103 101
  [23] 101 103 101 102 103
Then dataFrame[, tmp] is asking it to make a 27-column data.frame based
on those columns (which don't exist in the original 4-column data.frame).
You should have gotten an 'undefined columns selected' error.  Perhaps
it ran out of memory while checking all 184K * 13 columns.  That would be
odd.

Now if you used the calls I mentioned at first (in the working example)
and R hung, there might be ways to speed up the process.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of ramoss
> Sent: Wednesday, August 29, 2012 1:58 PM
> To: r-help at r-project.org
> Subject: [R] Deduping in R by multiple variables
> 
> I have a dataset w/ 184K obs & 16 variables.  In SAS I proc sort nodupkey it
> in seconds by 11 variables.
> I tried to do the same thing in R using both the unique & then the
> !duplicated functions but it just hangs there & I get no output.  Does
> anyone know how to solve this?
> 
> This is how I tried to do it in R:
> 
> 
> detail3 <-
> [!duplicated(c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
>                              detail2$BEGTIME,
> detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
>                              detail2$ACCTYP
> ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
>                              detail2$STKFUL)),]
> 
> detail3 <-
> unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
>           detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
>           detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
>           detail2$STKFUL)])
> 
> 
> 
> 
> Thanks in advance
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by-
> multiple-variables-tp4641778.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.