[R] Choose between duplicated rows

francy francy.casalino at gmail.com
Sat Apr 14 21:03:36 CEST 2012


Dear r experts,

Sorry for this basic question, but I can't seem to find a solution…

I have this data frame:
df <- data.frame(id = c("id1", "id1", "id1", "id2", "id2", "id2"), A =
c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 =
c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N", "Y",
"N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))

> df
   id     A v1 v2 v3 v4 v5                numMiss
1 id1 11905 NA NA NA  N  0        3
2 id1 11907  3  2  1  Y  0                 0
3 id1 11907 NA NA NA  N  0        3
4 id2 11829  1  2  1  Y  1                 0
5 id2 11829  2 NA NA  N  0          2
6 id2 11829 NA NA NA  N  0       3


And I need to keep, of the rows that have the same value for "A" by id, only
the ones with the least amount of missing values for all the variables (with
min(numMiss)) to get this:

   id     A v1 v2 v3 v4 v5                numMiss
1 id1 11905 NA NA NA  N  0        3
2 id1 11907  3  2  1  Y  0                 0
4 id2 11829  1  2  1  Y  1                 0

Then I have to choose the records with the least value of "A" of the rows
that have the same id like this:
   id     A v1 v2 v3 v4 v5                numMiss
1 id1 11905 NA NA NA  N  0        3
4 id2 11829  1  2  1  Y  1                 0

For groupings I have used the package "plyr" before, but this would involve
a sort of double-grouping by id and by duplicated values of A…Could you
please help me understand how this can be done? 

Thank you very much.
-f






--
View this message in context: http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list