[R] Remark on tapply().

Rolf Turner r.turner at auckland.ac.nz
Tue Dec 1 02:10:17 CET 2009


Consider the following:

 > set.seed(42)
 > ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5)
 > x <- runif(42)
 > tapply(x,ff,sum)
        1        2        3        4        5
3.675436       NA 7.519675       NA 9.094210

I got bitten by those NAs in the result of tapply().  Effectively
one is summing over the empty set, and consequently (according to what
I learned as a child) I thought that the result would be 0.

And that's what one gets if one does the sum ``by hand'':

 > sum(x[ff==1])
[1] 3.675436
 > sum(x[ff==2])
[1] 0
  > sum(x[ff==4])
[1] 0

On reflection I realized that since tapply() needs to work with  
arbitrary
functions, and since there is no way to determine what an arbitrary  
function
will do to the empty set, this is the Way It's Got to Be.

But it's a trap for young players, and so I thought I'd post my  
experience
as a warning to others to be careful about this.

To work around the problem one ***could*** do something like

 > result[is.na(result)] <- 0

but that's very infra dig in my book.  I figured out something I like
much better:

	sapply(tapply(x,ff,I,simplify=FALSE),sum)

That simplify=FALSE is needed just in case there is at most one entry of
x for each level of ff, in which case tapply will return an array with
NAs in it, rather than a list with NULL entries corresponding to  
empty cells,
unless simplify=FALSE is specified.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}




More information about the R-help mailing list