[R] help with duplicates
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Fri Jun 5 19:59:48 CEST 2009
Chris Anderson wrote:
> I have a large dataset that contain duplicate records. How do I identify and remove duplicate records?
>
Here's one way:
> aq <- airquality[sample(NROW(airquality), replace=TRUE),]
> any(duplicated(aq))
[1] TRUE
> which(duplicated(aq))
[1] 2 15 34 44 45 47 49 50 52 53 65 75 76 78 83 86
88 90 91
[20] 94 96 98 99 100 103 104 107 108 110 111 112 114 117 119 120 121
122 124
[39] 125 126 127 129 130 132 133 135 137 140 141 143 145 146 147 151 152
> aqs <- subset(aq,!duplicated(aq))
> any(duplicated(aqs))
[1] FALSE
> dim(aqs)
[1] 98 6
> dim(aq)
[1] 153 6
For data frames wit many columns you might want to think more carefully
about how you recognize duplicates and maybe uses a subset of columns.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list