[R] tapply() and using factor() on a factor
William Dunlap
wdunlap at tibco.com
Fri Oct 16 05:59:05 CEST 2009
> Dear List,
> Shouldn't result1 and result2 be equal in the following case?
> Note that log$RequestID is a factor. That is,
> is.factor(log$RequestID)
> yields TRUE.
> result1 <- tapply(log$Flag,factor(log$RequestID),sum)
>
> result2 <- tapply(log$Flag,log$RequestID,sum)
Showing us the output of dput(log) (or str(log) and summary(log))
would let people discover the problem more readily. Since you
didn't I'll guess what the dataset may contain.
If log$RequestID is a factor with lots of unused levels tapply
will output an NA for each unused level. factor(log$RequestID)
will create a new set of levels, only those actually used,
so tapply will not be forced to fill those spots with NA's. E.g.,
> log<-data.frame(Flag=1:2, RequestID=factor(letters[1:2],
levels=letters[1:10]))
> tapply(log$Flag, log$RequestID, sum)
a b c d e f g h i j
1 2 NA NA NA NA NA NA NA NA
> tapply(log$Flag, factor(log$RequestID), sum)
a b
1 2
I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see
how to fill the cells with no data behind them, but it doesn't.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> Yet, when I summarize the output, I get the following:
> summary(result1)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
>
> 11.00 11.00 11.00 26.06 11.00 101.00
> summary(result2)
> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
>
> 11.00 11.00 11.00 26.06 11.00 101.00 978.00
> Why does result2 have 978 NA's?
> Any help on this would be appreciated.
> Alex
