[R] Create factor variable by groups

Dennis Murphy djmuser at gmail.com
Tue Jul 5 10:05:30 CEST 2011


Hi:

There are several ways to do this; I'll offer one from the plyr
package. See inline.

On Mon, Jul 4, 2011 at 9:57 PM, Mateus Rabello
<mateus_rabello at hotmail.com> wrote:
> Hi, suppose that I have the following data.frame:
>
>      cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y
>      24996 10020470 1 1 2 12 16 21 17 51 43 19 183
>      24996 10020470 69 91 79 92 91 77 90 96 98 108 891
>      36145 10020470 0 0 0 0 2 83 112 97 91 144 529
>      44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110
>
>
> I would like to create a new variable X that indicates which line, within the cnpj variable, has the highest value Y. For instance, within the cnpj = 10020470, the second line has the largest value Y (891). For cnpj = 10023333 is trivial (1110). Then, my new data.frame would become:
>
>      cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y X
>      24996 10020470 1 1 2 12 16 21 17 51 43 19 183 FALSE
>      24996 10020470 69 91 79 92 91 77 90 96 98 108 891 TRUE
>      36145 10020470 0 0 0 0 2 83 112 97 91 144 529 FALSE
>      44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110 TRUE
>
>
> Notice that for every value of the variable cnpj, only one line will have X = TRUE.
>
> Then, I would like to create a variable Z that is the sum of variable Y, also by variable cnpj. Thus, if cnpj = 10020470, Z = 183 + 891 +529 and for cnpj = 10023333, Z = 120. These sums can easily be done with tapply or aggregate but those would eliminate line with equal cnpj and I don’t want that. I would like to achieve a data.frame like the following:
>
>      cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y X Z
>      24996 10020470 1 1 2 12 16 21 17 51 43 19 183 FALSE 1603
>      24996 10020470 69 91 79 92 91 77 90 96 98 108 891 TRUE 1603
>      36145 10020470 0 0 0 0 2 83 112 97 91 144 529 FALSE 1603
>      44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110 TRUE 1110

To get the above structure, then assuming this data frame is named df,
one way is to use the ddply() function in the plyr package with an
external function to do the work for a generic subset with constant
cnpj:

library(plyr)
myfun <- function(d) {
     d$X <- d$Y == max(d$Y)
     d$Z <- sum(d$Y)
     d
  }
ddply(df, .(cnpj), myfun)

>
>
> In the end I will eliminate all lines with X = FALSE.

To do this, all you need is to rewrite the function slightly:

myfun2 <- function(d) {
    d$Z <- sum(d$Y)
    d[which.max(d$Y), ]
  }
ddply(df, .(cnpj), myfun)

HTH,
Dennis


>
>
> Thank you and sorry for the long question.
>
> Mateus Rabello
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list