[R] (Newbie) Aggregate for NA values
Adaikalavan Ramasamy
ramasamy at cancer.org.uk
Fri Feb 24 17:05:07 CET 2006
I think it makes perfect sense for R to drop it since 'NA' represents
uninformative information. I do not know if there is a elegant solution
but I would suggest that you make these 'NA' into an informative value.
Here is one possibility:
df <- data.frame( AA=1:10, BB=rep(1:5,2), CC=rep(1:2,5), DD=rnorm(10) )
df[ 9:10, "CC" ] <- NA
df[is.na(df)] <- "lala" ## change NA's into informative category ##
aggregate( df$DD, by=list( df$CC ), mean )
Group.1 x
1 1 1.1533763
2 2 0.6427338
3 lala -0.2745249
aggregate( df$DD, by=list( df$BB, df$CC ), mean )
Group.1 Group.2 x
1 1 1 0.47264081
2 2 1 0.63795211
3 3 1 1.66756015
4 5 1 1.83535232
5 1 2 0.89914287
6 2 2 1.11102134
7 3 2 0.22268699
8 4 2 0.33808394
9 4 lala -0.60154608
10 5 lala 0.05249622
Regards, Adai
On Fri, 2006-02-24 at 10:16 -0500, Vivek Satsangi wrote:
> Folks,
>
> Sorry if this question has been answered before or is obvious (or
> worse, statistically "bad"). I don't understand what was said in one
> of the search results that seems somewhat related.
>
> I use aggregate to get a quick summary of the data. Part of what I am
> looking for in the summary is, how much influence might the NA's have
> had, if they were included, and is excluding them from the means
> causing some sort of bias. So I want the summary stat for the NA's
> also.
>
> Here is a simple example session (edited to remove the typos I made,
> comments added later):
>
> > tmp_a <- 1:10
> > tmp_b <- rep(1:5,2)
> > tmp_c <- rep(1:2,5)
> > tmp_d <- c(1,1,1,2,2,2,3,3,3,4)
> > tmp_df <- data.frame(tmp_a,tmp_b,tmp_c,tmp_d);
> > tmp_df$tmp_c[9:10] <- NA ;
> > tmp_df
> tmp_a tmp_b tmp_c tmp_d
> 1 1 1 1 1
> 2 2 2 2 1
> 3 3 3 1 1
> 4 4 4 2 2
> 5 5 5 1 2
> 6 6 1 2 2
> 7 7 2 1 3
> 8 8 3 2 3
> 9 9 4 NA 3
> 10 10 5 NA 4
> > aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_b,tmp_df$tmp_c),mean);
> Group.1 Group.2 x
> 1 1 1 1
> 2 2 1 3
> 3 3 1 1
> 4 5 1 2
> 5 1 2 2
> 6 2 2 1
> 7 3 2 3
> 8 4 2 2
> # Only one row for each (tmp_b, tmp_c) combination, NA's getting dropped.
>
> > aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_c),mean);
> Group.1 x
> 1 1 1.75
> 2 2 2.00
>
> What I want in this last aggregate is, a mean for the values in tmp_d
> that correspond to the tmp_c values of NA. Similarly, perhaps there is
> a way to make the second last call to aggregate return the values of
> tmp_d for the NA values of tmp_c also.
>
> How can I achieve this?
>
> --
> -- Vivek Satsangi
> Student, Rochester, NY USA
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
More information about the R-help
mailing list