[BioC] edgeR and FDR

Mon Jun 28 06:39:40 CEST 2010

Hi Naomi,

Davis has pointed out to me that I'm not quite correct.  edgeR does 
automatically filter out K<6 when estimating the common dispersion, but 
when doing the DE analysis the only automatic filter is to remove K=0.  I 
agree that a filter like you suggest is very sensible as a routine 
procedure, and I've been thinking along the same lines.  We did do this 
filtering for the 't Hoen data case study in the edgeR user's guide, but 
haven't done it so far for the other case studies.

Regards
Gordon

On Mon, 28 Jun 2010, Gordon K Smyth wrote:

> Hi Naomi,
>
> edgeR already does exactly what you suggest, although we chose p=0.05 
> (leading to K=5) for this purpose rather than 0.001.  You're right that a 
> more conservative value would probably be better.  However all the NextGen 
> data sets we've analysed so far have huge amounts of DE, so it hasn't been an 
> issue.
>
> Regards
> Gordon
>
> On Sat, 26 Jun 2010, Naomi Altman wrote:
>
>> 
>> Basically, if a global FDR is used with discrete data, then one should 
>> filter low expressing genes pretty stringently.  For example, one could 
>> compute K (the marginal total for the gene) for which the smallest possible 
>> p-value is .001 (e.g. use the ordinary Fisher's exact test as an 
>> approximation) and use only features with K or more reads in the study. 
>> This improves power for the (much smaller number of) remaining features, 
>> but obviously you will then need to sort manually through the low 
>> expressing genes to determine if you have missed something striking (such 
>> as all of the K-1 reads are in a single sample).
>> 
>> --Naomi
>> 
>> 
>> 
>> At 10:39 AM 6/26/2010, you wrote:
>>> Hi Naomi,
>>> 
>>> I agree that the discreteness of the counts introduces conservatism, and 
>>> that there is a power differential between low and high expressed genes. 
>>> However the expected overall FDR is still controlled at a rate less than 
>>> or equal to the nominal rate, and that is all we promise.
>>> 
>>> To reduce the trend in DE vs expression level, I like to combine FDR with 
>>> a fold-change cutoff or, perhaps better, use a TREAT like test.
>>> 
>>> Regards
>>> Gordon
>>> 
>>> On Sat, 26 Jun 2010, Naomi Altman wrote:
>>> 
>>>> Dear Gordon,
>>>> Thank you for your very detailed and clear answer to my question about 
>>>> the dispersion model.
>>>> 
>>>> Regarding FDR:
>>>> For discrete-valued test statistics, the distribution of the p-values 
>>>> under the null hypothesis is a discrete uniform which depends on the 
>>>> marginal total.  As a result,
>>>> under the distribution of p-values from the null hypotheses is a mixture 
>>>> of discrete uniforms, which can be marginally very non-uniform.  Even 
>>>> after filtering out low expressing genes, it is common to see a peak of 
>>>> p-values near 1.0 due to this effect.  It is less evident that there are 
>>>> multiple other peaks, one at each of the discrete values of the p-value 
>>>> for each marginal total.  The result of this is that FDR computations are 
>>>> far too conservative for lowly expressing genes, and far too liberal for 
>>>> highly expressing genes which basically magnifies the power differential 
>>>> that already exists due to the relationship between the mean and 
>>>> variance.
>>>> 
>>>> --Naomi
>>>> 
>>>> At 05:01 AM 6/26/2010, Gordon K Smyth wrote:
>>>>> Dear Zhe,
>>>>> To get FDR, you must use the topTags() function.  Is your de.com object 
>>>>> a deDGEList object?  If it is, then
>>>>>
>>>>>   top <- topTags(de.com, n=Inf)
>>>>>   write.table(top$table, file="yourfile.txt")
>>>>> will do what you want.  (I can't tell you what level of FDR to use as 
>>>>> your cutoff though, that's up to you.)
>>>>> Naomi, I don't know of any problem with FDR from edgeR.  It should work 
>>>>> just fine.
>>>>> Best wishes
>>>>> Gordon
>>>>> -----------------------------------------------
>>>>> Associate Professor Gordon K Smyth,
>>>>> NHMRC Senior Research Fellow,
>>>>> Bioinformatics Division, Walter and Eliza Hall Institute of Medical 
>>>>> Research, 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>> smyth at wehi.edu.au
>>>>> http://www.wehi.edu.au
>>>>> http://www.statsci.org/smyth
>>>>> 
>>>>> ------------ original message ---------------
>>>>> [BioC] edgeR question
>>>>> Naomi Altman naomi at stat.psu.edu
>>>>> Fri Jun 25 22:43:51 CEST 2010
>>>>> Hi Zhe,
>>>>> 1. First normalize and then do the DE
>>>>> analysis.  (I found this confusing in the vignette, too.)
>>>>> 2. I do not suggest using FDR at this time.  The
>>>>> standard FDR computations need to be adjusted for
>>>>> count data.  I do not think this has been worked out yet.
>>>>> --Naomi
>>>>> 
>>>>> At 12:21 PM 6/25/2010,  wrote:
>>>>> 
>>>>>> Hello,
>>>>>> I am learning edgeR and would like to use it
>>>>>> dealing with my Tag-seq and RNA-seq data. I have several questions:
>>>>>> 1. Does the DE analysis using common
>>>>>> dispersion or moderated tagwise dispersions use
>>>>>> the TMM method for normalization?  I am not
>>>>>> sure the relationship between Setion 6
>>>>>> (Normalization) and the following sections in
>>>>>> the user manual. I suppose I should normalize
>>>>>> the data first, and then perform DE analysis.
>>>>>> 2. Do you suggest to use P-value < 0.01? What
>>>>>> about FDR < 0.05? After saving de.tagwise (>
>>>>>> write.table(de.com[[1]], file =
>>>>>> "/Users/Zhe/edgeR/page7", sep = "\t")), I found
>>>>>> there is not a column of the FDR. How to
>>>>>> calculate the FDR for each gene and save it in the output file.
>>>>>> Thanks a lot.
>>>>>> Best wishes,
>>>>>> Zhe
>> 
>> Naomi S. Altman                                814-865-3791 (voice)
>> Associate Professor
>> Dept. of Statistics                              814-863-7114 (fax)
>> Penn State University                         814-865-1348 (Statistics)
>> University Park, PA 16802-2111
>> 
>> 
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}