[BioC] Problems with edgeR for differential expression
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Oct 15 07:48:43 CEST 2011
I committed the fix to the release version on 13 October, but I recommend
you upgrade to the devel branch anyway.
Best wishes
Gordon
On Fri, 14 Oct 2011, Nick Schurch wrote:
> Thanks for rooting out the source of this problem Gordon! If I understand
> you correctly, the fixed version is only avvailable in the devel brancha t
> the moment correct?
>
> If so I'll see about upgradint to that and see if that version of edgeR
> gives me more sensible results.
>
> Nick
>
> On 13 October 2011 03:54, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Dear Nick,
>>
>> On looking back through the svn commitments, I see that the problem
>> with NaN fold changes is a known problem with glmFit(), and that I
>> fixed it myself in the devel version of edgeR on 5 May 2011. Sorry, I
>> had forgotten. I was also the one who introduced the problem a bit
>> over a month earlier. Apparently I didn't fix the bug in the release
>> version, but I have done that now.
>>
>> The problem with NaN fold changes arose when one or more of the groups
>> for a particular transcript had all zero counts. In this case, NaNs
>> could be introduced for all the coefficients, not just for the group
>> with zero counts. It affected only the fold changes, the likelihood
>> ratio statistic and p-values were unaffected.
>>
>> Just FYI, the reason for the bug is because glmFit() is trying to be a
>> bit clever. glmFit() detects whenever your design matrix is equivalent
>> to a oneway layout. If so, it calls a special-purpose algorithm
>> implemented in mglmOneWay() to fit a oneway layout. The algorithm
>> estimates the log-expression for each group, which of course is -Inf if
>> all the counts are zero. The problem arises when the function converts
>> the log-expression values to the parametrization implied by your design
>> matrix. R unfortunately converts most calculations with -Inf values to
>> NaN. The problem has been fixed in mglmOneWay() by converting -Inf
>> log-expression values to -1e8.
>>
>> You do need to upgrade to the devel versions of both edgeR and limma.
>> The recommended way to do this is to install R-devel, then use
>> biocLite("limma") etc at the R prompt.
>>
>> The fold changes when one group has entirely zero counts are in
>> principle infinite. The actually values reported by edgeR or other
>> packages are usually set to some large value, and are therefore
>> somewhat arbitrary in actual value, except in that they are larger than
>> other fold changes.
>>
>> When comparing glmLRT() with exactTest(), the p-values will naturally
>> be most different when some of the counts are very small, and zero is
>> the smallest value that is possible. It is also naturally in this case
>> that the p-values are most discrete. This is to be expected.
>>
>> Note that the p-values returned by exactTest() are now slightly
>> different when your groups have unequal numbers of samples. This
>> because a different method is now used to define the rejection region.
>> See the help page for a description.
>>
>> Best wishes
>> Gordon
>>
>> ------------------------------**---------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> Tel: (03) 9345 2326, Fax (03) 9347 0852,
>> smyth at wehi.edu.au
>> http://www.wehi.edu.au
>> http://www.statsci.org/smyth
>>
>>
>> On Tue, 11 Oct 2011, Nick Schurch wrote:
>>
>> Hi Gordon,
>>>
>>> Thanks for the quick response, and sorry I didn't get back to you
>>> yesterday. I'll try to address each of your questions clearly.
>>>
>>> 1)
>>>
>>> Finally, would you be willing to share some of your date with us offline
>>>> so that we can trouble-shoot? As I said, we haven't seen this behaviour.
>>>>
>>>
>>> We'd be happy to share the anonymized data with you offline. Giving you
>>> the DGElist, or the dataframe would be best from our perspective. What would
>>> be the best way of getting it to you?
>>>
>>> 2)
>>>
>>> Can you try updating to edgeR 2.3.52 (the current devel version) to see
>>>> it makes a difference? (There are many changes, but one is that exactTest()
>>>> is much faster now.)
>>>>
>>>
>>> I have downloaded v2.3.52, but I'm having some problems building it.
>>>
>>> R CMD INSTALL edgeR_2.3.52.tar.gz
>>> * installing to library â/homes/nschurch/myRlibrary/**x86_64â
>>> * installing *source* package âedgeRâ ...
>>> ** R
>>> ** data
>>> ** inst
>>> ** preparing package for lazy loading
>>> Error : class "MDS" is not exported by 'namespace:limma'
>>> ERROR: lazy loading failed for package âedgeRâ
>>> * removing â/homes/nschurch/myRlibrary/**x86_64/edgeRâ
>>>
>>> I have limma (v3.4.4) installed and it seems to be working fine so I'm not
>>> quite sure what is going on here. Do I need a more up to date version of
>>> limma for the development build of edgeR?
>>>
>>> 3)
>>>
>>> Given your experimental layout with multiple groups but no covariates,
>>>> you could use the "classic" edgeR functions:
>>>>
>>>> estimateCommonDisp(y)
>>>> exactTest()
>>>> topTags()
>>>>
>>>> rather than the glm code. This would allow you to compare any two of
>>>> your groups. What results does this give?
>>>>
>>>
>>> This seems to give sensible results; I don't get any NaN fold-changes. But
>>> it seems to present some new questions. A quick check confirms that the
>>> fold-changes calculated by each method are tightly correlated (for those
>>> that aren't 'NaN'), but a plot of the p-values calculated by each method is
>>> more complicated.
>>>
>>> Attached are two .pngs of this plot, one for the full range of p-values,
>>> and one for a zoomed in region of the plot. In these figures, the blue
>>> points are genes with non-NaN fold-changes in both calculations and the red
>>> points are those with NaN fold-changes in the glm calculation. The solid red
>>> line marks the 1:1 line.
>>>
>>> While the p-values are correlated, there is considerable structure to the
>>> plot. There seems to be at least 3 distinct regions to the plot that are al
>>> removed from the 1:1 line. The zoomed plot also shows clear discretization
>>> of the p-values from the exactTest for the red points (and maybe also for
>>> the glm p-values), that isn't clearly seen in the blue points. This makes me
>>> think that while the exactTest is calculating a reasonable-looking fold
>>> change, that they are not to be trusted and that something funny is going
>>> on. We are also getting strange results when comparing edgeR p-values (from
>>> either method) with those calculated by DESeq for the same data (I can send
>>> you these plots if you're interested).
>>>
>>> I really think there is something funny going on here. Have you seen
>>> anything like these structures when comparing the p-values calculated by
>>> each method?
>>>
>>
>>
>> I'm running edgeR on data on about 6000 genes, across 5 experimental
>>>>> conditions. When I compute the differential expression for any two of these
>>>>> conditions edgeR it is returning a 'nan' fold-change for about 1500 out of
>>>>> the 6000 genes being tested. Amazingly, it is also returning p-values for
>>>>> these fold-changes, and the p-values cover a range of values from total
>>>>> insignificant (>0.5) to very significant (<1E-4)! At first I thought if
>>>>> might be because there is sometimes no signal for a gene in a given
>>>>> condition, but 1) other cases of this nature produce '-inf' or 'inf'
>>>>> fold-changes, not 'nan' fold-hanges, and 2) in some cases edgeR is
>>>>> calculating a 'nan' fold-change for something that has signal in every
>>>>> replicate of both conditions! I've checked everything I can think of... I
>>>>> don't get any errors or warnings, the Norm Factors are sensible, the raw
>>>>> signal is sensible, all the objects look well-formed (i.e. like they contain
>>>>> all the bits they should contain).... its very confusing and frustrating.
>>>>>
>>>>> So, am I doing something wrong? Or is there a deeper problem with this
>>>>> package?
>>>>>
>>>>> I'm using R v 2.13.1, edgeR 2.2.5 and the commands I'm using are:
>>>>>
>>>>> # create the design matrix
>>>>>
>>>>>> model.groups<-groups
>>>>>> model.factors<-as.factor(****model.groups)
>>>>>> model<-model.matrix(~model.****factors)
>>>>>>
>>>>>>
>>>>>> # build DGElist and calculate normalization factors
>>>>>> x=as.data.frame(data)
>>>>>> rownames(x)=genenames
>>>>>> y=DGEList(x,group=groups)
>>>>>> y=calcNormFactors(y)
>>>>>>
>>>>>
>> Cheers,
>>>>
>>>> Nick Schurch
>>>>
>>>> Data Analysis Group (The Barton Group),
>>>> School of Life Sciences,
>>>> University of Dundee,
>>>> Dow St,
>>>> Dundee,
>>>> DD1 5EH,
>>>> Scotland,
>>>> UK
>>>>
>>>> Tel: +44 1382 388707
>>>> Fax: +44 1382 345 893
>>>>
>>>
>> ______________________________**______________________________**__________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________**______________________________**__________
>>
>
>
>
> --
> Cheers,
>
> Nick Schurch
>
> Data Analysis Group (The Barton Group),
> School of Life Sciences,
> University of Dundee,
> Dow St,
> Dundee,
> DD1 5EH,
> Scotland,
> UK
>
> Tel: +44 1382 388707
> Fax: +44 1382 345 893
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}
More information about the Bioconductor
mailing list