[R] tapply() and using factor() on a factor
David Winsemius
dwinsemius at comcast.net
Fri Oct 16 18:02:34 CEST 2009
On Oct 16, 2009, at 11:33 AM, Alexander Peterhansl wrote:
> Thank you Mohamed and Bill for your replies. (I did not send the data
> because it is unwieldy.)
>
> Yes Bill, the issue arises directly from what you had guessed. I was
> working with a subset of the data (which implicitly had factors for
> the
> complete data set).
>
> On this, what is the best way take a subset of the data which ignores
> these "extraneous" factors?
>
>> log<-data.frame(Flag=1:2,
> RequestID=factor(letters[1:2],levels=letters[1:10]))
>> log2 <-subset(log, RequestID=="a")
>
>> levels(log2$RequestID)
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
log2$RequestID <- factor(log2$RequestID)
You might think that log2 <-subset(log, RequestID=="a", drop=TRUE)
might do that task, but it clearly doesn't.
--
DW
> In other words, how do I take a subset which yields "a" as the only
> level for log2?
>
> Alex
>
> -----Original Message-----
> From: William Dunlap [mailto:wdunlap at tibco.com]
> Sent: Thursday, October 15, 2009 11:59 PM
> To: Alexander Peterhansl; r-help at r-project.org
> Subject: RE: [R] tapply() and using factor() on a factor
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Alexander
>> Peterhansl
>> Sent: Thursday, October 15, 2009 2:50 PM
>> To: r-help at r-project.org
>> Subject: [R] tapply() and using factor() on a factor
>>
>> Dear List,
>> Shouldn't result1 and result2 be equal in the following case?
>>
>> Note that log$RequestID is a factor. That is,
>> is.factor(log$RequestID)
>> yields TRUE.
>>
>> result1 <- tapply(log$Flag,factor(log$RequestID),sum)
>>
>> result2 <- tapply(log$Flag,log$RequestID,sum)
>
> Showing us the output of dput(log) (or str(log) and summary(log))
> would let people discover the problem more readily. Since you
> didn't I'll guess what the dataset may contain.
>
> If log$RequestID is a factor with lots of unused levels tapply
> will output an NA for each unused level. factor(log$RequestID)
> will create a new set of levels, only those actually used,
> so tapply will not be forced to fill those spots with NA's. E.g.,
>
>> log<-data.frame(Flag=1:2, RequestID=factor(letters[1:2],
> levels=letters[1:10]))
>> tapply(log$Flag, log$RequestID, sum)
> a b c d e f g h i j
> 1 2 NA NA NA NA NA NA NA NA
>> tapply(log$Flag, factor(log$RequestID), sum)
> a b
> 1 2
>
> I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see
> how to fill the cells with no data behind them, but it doesn't.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>
>>
>>
>> Yet, when I summarize the output, I get the following:
>>
>> summary(result1)
>>
>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>>
>> 11.00 11.00 11.00 26.06 11.00 101.00
>>
>>
>>
>> summary(result2)
>>
>> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
>>
>> 11.00 11.00 11.00 26.06 11.00 101.00 978.00
>>
>>
>>
>> Why does result2 have 978 NA's?
>>
>>
>>
>> Any help on this would be appreciated.
>>
>>
>>
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list