[BioC] Problems with edgeR for differential expression

Thu Oct 13 04:54:04 CEST 2011

Dear Nick,

On looking back through the svn commitments, I see that the problem with 
NaN fold changes is a known problem with glmFit(), and that I fixed it 
myself in the devel version of edgeR on 5 May 2011.  Sorry, I had 
forgotten.  I was also the one who introduced the problem a bit over a 
month earlier.  Apparently I didn't fix the bug in the release version, 
but I have done that now.

The problem with NaN fold changes arose when one or more of the groups for 
a particular transcript had all zero counts.  In this case, NaNs could be 
introduced for all the coefficients, not just for the group with zero 
counts.  It affected only the fold changes, the likelihood ratio statistic 
and p-values were unaffected.

Just FYI, the reason for the bug is because glmFit() is trying to be a bit 
clever.  glmFit() detects whenever your design matrix is equivalent to a 
oneway layout.  If so, it calls a special-purpose algorithm implemented in 
mglmOneWay() to fit a oneway layout.  The algorithm estimates the 
log-expression for each group, which of course is -Inf if all the counts 
are zero.  The problem arises when the function converts the 
log-expression values to the parametrization implied by your design 
matrix.  R unfortunately converts most calculations with -Inf values to 
NaN.  The problem has been fixed in mglmOneWay() by converting -Inf 
log-expression values to -1e8.

You do need to upgrade to the devel versions of both edgeR and limma. 
The recommended way to do this is to install R-devel, then use 
biocLite("limma") etc at the R prompt.

The fold changes when one group has entirely zero counts are in principle 
infinite.  The actually values reported by edgeR or other packages are 
usually set to some large value, and are therefore somewhat arbitrary in 
actual value, except in that they are larger than other fold changes.

When comparing glmLRT() with exactTest(), the p-values will naturally be 
most different when some of the counts are very small, and zero is the 
smallest value that is possible.  It is also naturally in this case that 
the p-values are most discrete.  This is to be expected.

Note that the p-values returned by exactTest() are now slightly different 
when your groups have unequal numbers of samples.  This because a 
different method is now used to define the rejection region.  See the help 
page for a description.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Tue, 11 Oct 2011, Nick Schurch wrote:

> Hi Gordon,
>
> Thanks for the quick response, and sorry I didn't get back to you 
> yesterday. I'll try to address each of your questions clearly.
>
> 1)
>
>> Finally, would you be willing to share some of your date with us 
>> offline so that we can trouble-shoot?  As I said, we haven't seen this 
>> behaviour.
>
> We'd be happy to share the anonymized data with you offline. Giving you 
> the DGElist, or the dataframe would be best from our perspective. What 
> would be the best way of getting it to you?
>
> 2)
>
>> Can you try updating to edgeR 2.3.52 (the current devel version) to see 
>> it makes a difference?  (There are many changes, but one is that 
>> exactTest() is much faster now.)
>
> I have downloaded v2.3.52, but I'm having some problems building it.
>
> R CMD INSTALL edgeR_2.3.52.tar.gz
> * installing to library â€˜/homes/nschurch/myRlibrary/x86_64â€™
> * installing *source* package â€˜edgeRâ€™ ...
> ** R
> ** data
> ** inst
> ** preparing package for lazy loading
> Error : class "MDS" is not exported by 'namespace:limma'
> ERROR: lazy loading failed for package â€˜edgeRâ€™
> * removing â€˜/homes/nschurch/myRlibrary/x86_64/edgeRâ€™
>
> I have limma (v3.4.4) installed and it seems to be working fine so I'm not
> quite sure what is going on here. Do I need a more up to date version of
> limma for the development build of edgeR?
>
> 3)
>
>> Given your experimental layout with multiple groups but no covariates, 
>> you could use the "classic" edgeR functions:
>>
>> estimateCommonDisp(y)
>> exactTest()
>> topTags()
>>
>> rather than the glm code.  This would allow you to compare any two of 
>> your groups. What results does this give?
>
> This seems to give sensible results; I don't get any NaN fold-changes. 
> But it seems to present some new questions. A quick check confirms that 
> the fold-changes calculated by each method are tightly correlated (for 
> those that aren't 'NaN'), but a plot of the p-values calculated by each 
> method is more complicated.
>
> Attached are two .pngs of this plot, one for the full range of p-values, 
> and one for a zoomed in region of the plot. In these figures, the blue 
> points are genes with non-NaN fold-changes in both calculations and the 
> red points are those with NaN fold-changes in the glm calculation. The 
> solid red line marks the 1:1 line.
>
> While the p-values are correlated, there is considerable structure to 
> the plot. There seems to be at least 3 distinct regions to the plot that 
> are al removed from the 1:1 line. The zoomed plot also shows clear 
> discretization of the p-values from the exactTest for the red points 
> (and maybe also for the glm p-values), that isn't clearly seen in the 
> blue points. This makes me think that while the exactTest is calculating 
> a reasonable-looking fold change, that they are not to be trusted and 
> that something funny is going on. We are also getting strange results 
> when comparing edgeR p-values (from either method) with those calculated 
> by DESeq for the same data (I can send you these plots if you're 
> interested).
>
> I really think there is something funny going on here. Have you seen 
> anything like these structures when comparing the p-values calculated by 
> each method?

>>> I'm running edgeR on data on about 6000 genes, across 5 experimental 
>>> conditions. When I compute the differential expression for any two of 
>>> these conditions edgeR it is returning a 'nan' fold-change for about 
>>> 1500 out of the 6000 genes being tested. Amazingly, it is also 
>>> returning p-values for these fold-changes, and the p-values cover a 
>>> range of values from total insignificant (>0.5) to very significant 
>>> (<1E-4)! At first I thought if might be because there is sometimes no 
>>> signal for a gene in a given condition, but 1) other cases of this 
>>> nature produce '-inf' or 'inf' fold-changes, not 'nan' fold-hanges, 
>>> and 2) in some cases edgeR is calculating a 'nan' fold-change for 
>>> something that has signal in every replicate of both conditions! I've 
>>> checked everything I can think of... I don't get any errors or 
>>> warnings, the Norm Factors are sensible, the raw signal is sensible, 
>>> all the objects look well-formed (i.e. like they contain all the bits 
>>> they should contain).... its very confusing and frustrating.
>>>
>>> So, am I doing something wrong? Or is there a deeper problem with this 
>>> package?
>>>
>>> I'm using R v 2.13.1, edgeR 2.2.5 and the commands I'm using are:
>>>
>>>  # create the design matrix
>>>> model.groups<-groups
>>>> model.factors<-as.factor(**model.groups)
>>>> model<-model.matrix(~model.**factors)
>>>>
>>>> # build DGElist and calculate normalization factors
>>>> x=as.data.frame(data)
>>>> rownames(x)=genenames
>>>> y=DGEList(x,group=groups)
>>>> y=calcNormFactors(y)

>> Cheers,
>>
>> Nick Schurch
>>
>> Data Analysis Group (The Barton Group),
>> School of Life Sciences,
>> University of Dundee,
>> Dow St,
>> Dundee,
>> DD1 5EH,
>> Scotland,
>> UK
>>
>> Tel: +44 1382 388707
>> Fax: +44 1382 345 893

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}