[R] Deduping in R by multiple variables
William Dunlap
wdunlap at tibco.com
Thu Aug 30 01:00:12 CEST 2012
You can find out which rows of a data.frame called dataFrame
are duplicates of previous rows with
dups <- duplicated(dataFrame)
To make a new data.frame without them do
duplessDataFrame <- dataFrame[!dups, ]
You could use unique(dataFrame), but, as in your examples, I
think one often wants to remove duplicates based on only
some of the columns. E.g., with the following data.frame
dataFrame <- data.frame(Name=LETTERS[1:9],
One=rep(1:3,3),
Two=c(11,12,13,11,11,12,12,13,13),
Three=c(101,102,103,101,101,103,101,102,103))
we get
> dataFrame
Name One Two Three
1 A 1 11 101
2 B 2 12 102
3 C 3 13 103
4 D 1 11 101
5 E 2 11 101
6 F 3 12 103
7 G 1 12 101
8 H 2 13 102
9 I 3 13 103
> duplicated(dataFrame)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dups123 <- duplicated(dataFrame[,c("One","Two","Three")])
> dups123
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
> dataFrame[!dups123, ]
Name One Two Three
1 A 1 11 101
2 B 2 12 102
3 C 3 13 103
5 E 2 11 101
6 F 3 12 103
7 G 1 12 101
8 H 2 13 102
Your first expression
detail3 <- [!duplicated(...)]
must have caused a syntax error, as "[" is the subscript operator
and requires something before it, as in datail2[...].
To see why your second attempt
detail3 <-
unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
detail2$STKFUL)])
will not do what you want (even if it did finish in a reasonable amount of time)
break it into pieces and use the example dataset above. You asked it to extract
the columns specified by 'tmp' where 'tmp' was constructed by:
> print(tmp <- c(dataFrame$One, dataFrame$Two, dataFrame$Three))
[1] 1 2 3 1 2 3 1 2 3 11 12
[12] 13 11 11 12 12 13 13 101 102 103 101
[23] 101 103 101 102 103
Then dataFrame[, tmp] is asking it to make a 27-column data.frame based
on those columns (which don't exist in the original 4-column data.frame).
You should have gotten an 'undefined columns selected' error. Perhaps
it ran out of memory while checking all 184K * 13 columns. That would be
odd.
Now if you used the calls I mentioned at first (in the working example)
and R hung, there might be ways to speed up the process.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of ramoss
> Sent: Wednesday, August 29, 2012 1:58 PM
> To: r-help at r-project.org
> Subject: [R] Deduping in R by multiple variables
>
> I have a dataset w/ 184K obs & 16 variables. In SAS I proc sort nodupkey it
> in seconds by 11 variables.
> I tried to do the same thing in R using both the unique & then the
> !duplicated functions but it just hangs there & I get no output. Does
> anyone know how to solve this?
>
> This is how I tried to do it in R:
>
>
> detail3 <-
> [!duplicated(c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
> detail2$BEGTIME,
> detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
> detail2$ACCTYP
> ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
> detail2$STKFUL)),]
>
> detail3 <-
> unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH,
> detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND,
> detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS,
> detail2$STKFUL)])
>
>
>
>
> Thanks in advance
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by-
> multiple-variables-tp4641778.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list