[BioC] strange results with edgeR::goodTuring
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Sep 6 09:26:45 CEST 2012
Dear Francois,
Thanks to new C code from Aaron Lun, I have now committed a fixed version
of goodTuring() to edgeR on the BioC devel repository.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
On Tue, 28 Aug 2012, Francois Pepin wrote:
> Thanks for checking it out.
>
> I'll see if I can find the bug or just work around it.
>
> I'm not actually using edgeR in this case other than this function. I
> was only looking for a an existing implementation of the Good-Turing
> method and found it in edgeR.
>
> François
>
> On Aug 28, 2012, at 3:32 , Gordon K Smyth wrote:
>
>> Hi Francois,
>>
>> Well, looks like your data example has exposed a bug in my R
>> implementation of the Good-Turing algorithm. I just ran the original C
>> code for the algorithm on your data, and it gives the following output:
>>
>> 0 0
>> 312 0.0001363
>> 14491 0.006316
>> 16401 0.007149
>> 65124 0.02839
>> 129797 0.05657
>> 323321 0.1409
>> 366051 0.1595
>> 368599 0.1607
>> 405261 0.1766
>> 604962 0.2637
>>
>> I'll have to think about what to do about this. I don't really have time
>> to track down the bug. We could bring C code into edgeR instead, but the
>> original C code would need some porting. The R code gives identical
>> results to the C for longer vectors with a more typical pile-up of
>> frequencies.
>>
>> I wonder what you mean when you say you want to estimate what kind of
>> pseudo counts to use. In edgeR terminology, the pseudo counts are
>> computed internally, and the user doesn't get to choose them.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Mon, 27 Aug 2012 12:00:19 -0700
>>> From: "Francois Pepin" <francois.pepin at sequentainc.com>
>>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>>> Subject: [BioC] strange results with edgeR::goodTuring
>>>
>>> Hi everyone,
>>>
>>> I'm trying to use the goodTuring function in edgeR to estimate what kind
>>> of pseudocounts to use and I'm getting strange results with small number
>>> of categories:
>>>
>>> x<-c(312,14491,16401,65124,129797,323321,366051,368599,405261,604962)
>>> y<- goodTuring(x)
>>> y
>>> $count
>>> [1] 312 14491 16401 65124 129797 323321 366051 368599 405261 604962
>>>
>>> $proportion
>>> [1] 0 0 0 0 0 0 0 0 0 1
>>>
>>> $P0
>>> [1] 0
>>>
>>> $n0
>>> [1] 0
>>>
>>>
>>> If I'm understanding this properly, y$proportion is telling me that I
>>> should expect all my counts to fall under the last category, which does
>>> not make sense. I would expect something pretty close to x/sum(x)
>>> instead.
>>>
>>> This is a bit of a toy example and I'm mostly interested in cases where
>>> I have more categories but it would be nice if this could work in all
>>> cases.
>>>
>>> sessionInfo()
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] edgeR_2.6.9 limma_3.12.1 dataframe_2.5
>>>
>>>
>>> Thanks,
>>>
>>> Fran?ois
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}
More information about the Bioconductor
mailing list