[R] tapply() and using factor() on a factor

Fri Oct 16 18:02:34 CEST 2009

On Oct 16, 2009, at 11:33 AM, Alexander Peterhansl wrote:

> Thank you Mohamed and Bill for your replies.  (I did not send the data
> because it is unwieldy.)
>
> Yes Bill, the issue arises directly from what you had guessed.  I was
> working with a subset of the data (which implicitly had factors for  
> the
> complete data set).
>
> On this, what is the best way take a subset of the data which ignores
> these "extraneous" factors?
>
>> log<-data.frame(Flag=1:2,
> RequestID=factor(letters[1:2],levels=letters[1:10]))
>> log2 <-subset(log, RequestID=="a")
>
>> levels(log2$RequestID)
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

log2$RequestID <- factor(log2$RequestID)

You might think that log2 <-subset(log, RequestID=="a", drop=TRUE)  
might do that task, but it clearly doesn't.

-- 
DW

> In other words, how do I take a subset which yields "a" as the only
> level for log2?
>
> Alex
>
> -----Original Message-----
> From: William Dunlap [mailto:wdunlap at tibco.com]
> Sent: Thursday, October 15, 2009 11:59 PM
> To: Alexander Peterhansl; r-help at r-project.org
> Subject: RE: [R] tapply() and using factor() on a factor
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Alexander
>> Peterhansl
>> Sent: Thursday, October 15, 2009 2:50 PM
>> To: r-help at r-project.org
>> Subject: [R] tapply() and using factor() on a factor
>>
>> Dear List,
>> Shouldn't result1 and result2 be equal in the following case?
>>
>> Note that log$RequestID is a factor.  That is,
>> is.factor(log$RequestID)
>> yields TRUE.
>>
>> result1 <- tapply(log$Flag,factor(log$RequestID),sum)
>>
>> result2 <- tapply(log$Flag,log$RequestID,sum)
>
> Showing us the output of dput(log) (or str(log) and summary(log))
> would let people discover the problem more readily.  Since you
> didn't I'll guess what the dataset may contain.
>
> If log$RequestID is a factor with lots of unused levels tapply
> will output an NA for each unused level.  factor(log$RequestID)
> will create a new set of levels, only those actually used,
> so tapply will not be forced to fill those spots with NA's.  E.g.,
>
>> log<-data.frame(Flag=1:2, RequestID=factor(letters[1:2],
> levels=letters[1:10]))
>> tapply(log$Flag, log$RequestID, sum)
> a  b  c  d  e  f  g  h  i  j
> 1  2 NA NA NA NA NA NA NA NA
>> tapply(log$Flag, factor(log$RequestID), sum)
> a b
> 1 2
>
> I suppose tapply(X,INDEX,FUN) could call FUN(X[0]) to see
> how to fill the cells with no data behind them, but it doesn't.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>
>>
>>
>> Yet, when I summarize the output, I get the following:
>>
>> summary(result1)
>>
>>   Min.    1st Qu.  Median  Mean 3rd Qu.    Max.
>>
>>  11.00   11.00     11.00      26.06   11.00       101.00
>>
>>
>>
>> summary(result2)
>>
>>   Min. 1st Qu.  Median Mean 3rd Qu.    Max.    NA's
>>
>>  11.00   11.00   11.00        26.06   11.00  101.00   978.00
>>
>>
>>
>> Why does result2 have 978 NA's?
>>
>>
>>
>> Any help on this would be appreciated.
>>
>>
>>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT