[R] Select rows based on multiple comparisons

Mon Jan 16 23:33:46 CET 2012

On Jan 16, 2012, at 4:13 PM, Felipe Nunes wrote:

> Dear all,
> 
> I have a data set in which the same unit appears 2 or 3 or 4 times. I need
> to aggregate this data to maintain only one unit by row. But I need to do
> that based on a comparison between the values of such units. I can't find a
> function to help me on that. I appreciate any help. Below I provide an
> example of what I want:
> 
> This is my data:
> 
> Units  Var1 Var2 Var3
>  1          B       2        2
>  1          C       1       3
>  2          D      3        3
>  2          C      2        1
>  2          A      1        2
>  3          C      2        1
>  3          A      1        3
>  4          B      3        3
>  4          C      2        2
>  4          A      1        1
> 
> I want to produce a new data in which I keep, for each unit, the row with
> max number for Var2. For instance,
> 
> Units  Var1 Var2 Var3
>  1          B       2       2
>  2          D      3        3
>  3          C      2        1
>  4          B      3        3
> 
> Thanks,

There are likely to be multiple approaches, but the first that comes to mind is combining ?merge with ?aggregate. 

> DF
   Units Var1 Var2 Var3
1      1    B    2    2
2      1    C    1    3
3      2    D    3    3
4      2    C    2    1
5      2    A    1    2
6      3    C    2    1
7      3    A    1    3
8      4    B    3    3
9      4    C    2    2
10     4    A    1    1

> merge(aggregate(Var2 ~ Units, max, data = DF), DF)
  Units Var2 Var1 Var3
1     1    2    B    2
2     2    3    D    3
3     3    2    C    1
4     4    3    B    3

The use of aggregate() gets you the max value of Var2 for each group within Units:

> aggregate(Var2 ~ Units, max, data = DF)
  Units Var2
1     1    2
2     2    3
3     3    2
4     4    3

merge() then allows you to add back the columns from 'DF', matching on 'Units' and 'Var2', which are the common columns between the two data frames. Note that the ordering of the columns is changed a bit here. That is easily adjusted with:

> merge(aggregate(Var2 ~ Units, max, data = DF), DF)[, c(1, 3, 2, 4)]
  Units Var1 Var2 Var3
1     1    B    2    2
2     2    D    3    3
3     3    C    2    1
4     4    B    3    3

Note that you would want to use the 'na.rm = TRUE' argument to max() in the call to aggregate() if in fact any of your Var2 arguments are NA.

HTH,

Marc Schwartz