[BioC] Problems with edgeR for differential expression
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Oct 13 04:54:04 CEST 2011
Dear Nick,
On looking back through the svn commitments, I see that the problem with
NaN fold changes is a known problem with glmFit(), and that I fixed it
myself in the devel version of edgeR on 5 May 2011. Sorry, I had
forgotten. I was also the one who introduced the problem a bit over a
month earlier. Apparently I didn't fix the bug in the release version,
but I have done that now.
The problem with NaN fold changes arose when one or more of the groups for
a particular transcript had all zero counts. In this case, NaNs could be
introduced for all the coefficients, not just for the group with zero
counts. It affected only the fold changes, the likelihood ratio statistic
and p-values were unaffected.
Just FYI, the reason for the bug is because glmFit() is trying to be a bit
clever. glmFit() detects whenever your design matrix is equivalent to a
oneway layout. If so, it calls a special-purpose algorithm implemented in
mglmOneWay() to fit a oneway layout. The algorithm estimates the
log-expression for each group, which of course is -Inf if all the counts
are zero. The problem arises when the function converts the
log-expression values to the parametrization implied by your design
matrix. R unfortunately converts most calculations with -Inf values to
NaN. The problem has been fixed in mglmOneWay() by converting -Inf
log-expression values to -1e8.
You do need to upgrade to the devel versions of both edgeR and limma.
The recommended way to do this is to install R-devel, then use
biocLite("limma") etc at the R prompt.
The fold changes when one group has entirely zero counts are in principle
infinite. The actually values reported by edgeR or other packages are
usually set to some large value, and are therefore somewhat arbitrary in
actual value, except in that they are larger than other fold changes.
When comparing glmLRT() with exactTest(), the p-values will naturally be
most different when some of the counts are very small, and zero is the
smallest value that is possible. It is also naturally in this case that
the p-values are most discrete. This is to be expected.
Note that the p-values returned by exactTest() are now slightly different
when your groups have unequal numbers of samples. This because a
different method is now used to define the rejection region. See the help
page for a description.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
On Tue, 11 Oct 2011, Nick Schurch wrote:
> Hi Gordon,
>
> Thanks for the quick response, and sorry I didn't get back to you
> yesterday. I'll try to address each of your questions clearly.
>
> 1)
>
>> Finally, would you be willing to share some of your date with us
>> offline so that we can trouble-shoot? As I said, we haven't seen this
>> behaviour.
>
> We'd be happy to share the anonymized data with you offline. Giving you
> the DGElist, or the dataframe would be best from our perspective. What
> would be the best way of getting it to you?
>
> 2)
>
>> Can you try updating to edgeR 2.3.52 (the current devel version) to see
>> it makes a difference? (There are many changes, but one is that
>> exactTest() is much faster now.)
>
> I have downloaded v2.3.52, but I'm having some problems building it.
>
> R CMD INSTALL edgeR_2.3.52.tar.gz
> * installing to library â/homes/nschurch/myRlibrary/x86_64â
> * installing *source* package âedgeRâ ...
> ** R
> ** data
> ** inst
> ** preparing package for lazy loading
> Error : class "MDS" is not exported by 'namespace:limma'
> ERROR: lazy loading failed for package âedgeRâ
> * removing â/homes/nschurch/myRlibrary/x86_64/edgeRâ
>
> I have limma (v3.4.4) installed and it seems to be working fine so I'm not
> quite sure what is going on here. Do I need a more up to date version of
> limma for the development build of edgeR?
>
> 3)
>
>> Given your experimental layout with multiple groups but no covariates,
>> you could use the "classic" edgeR functions:
>>
>> estimateCommonDisp(y)
>> exactTest()
>> topTags()
>>
>> rather than the glm code. This would allow you to compare any two of
>> your groups. What results does this give?
>
> This seems to give sensible results; I don't get any NaN fold-changes.
> But it seems to present some new questions. A quick check confirms that
> the fold-changes calculated by each method are tightly correlated (for
> those that aren't 'NaN'), but a plot of the p-values calculated by each
> method is more complicated.
>
> Attached are two .pngs of this plot, one for the full range of p-values,
> and one for a zoomed in region of the plot. In these figures, the blue
> points are genes with non-NaN fold-changes in both calculations and the
> red points are those with NaN fold-changes in the glm calculation. The
> solid red line marks the 1:1 line.
>
> While the p-values are correlated, there is considerable structure to
> the plot. There seems to be at least 3 distinct regions to the plot that
> are al removed from the 1:1 line. The zoomed plot also shows clear
> discretization of the p-values from the exactTest for the red points
> (and maybe also for the glm p-values), that isn't clearly seen in the
> blue points. This makes me think that while the exactTest is calculating
> a reasonable-looking fold change, that they are not to be trusted and
> that something funny is going on. We are also getting strange results
> when comparing edgeR p-values (from either method) with those calculated
> by DESeq for the same data (I can send you these plots if you're
> interested).
>
> I really think there is something funny going on here. Have you seen
> anything like these structures when comparing the p-values calculated by
> each method?
>>> I'm running edgeR on data on about 6000 genes, across 5 experimental
>>> conditions. When I compute the differential expression for any two of
>>> these conditions edgeR it is returning a 'nan' fold-change for about
>>> 1500 out of the 6000 genes being tested. Amazingly, it is also
>>> returning p-values for these fold-changes, and the p-values cover a
>>> range of values from total insignificant (>0.5) to very significant
>>> (<1E-4)! At first I thought if might be because there is sometimes no
>>> signal for a gene in a given condition, but 1) other cases of this
>>> nature produce '-inf' or 'inf' fold-changes, not 'nan' fold-hanges,
>>> and 2) in some cases edgeR is calculating a 'nan' fold-change for
>>> something that has signal in every replicate of both conditions! I've
>>> checked everything I can think of... I don't get any errors or
>>> warnings, the Norm Factors are sensible, the raw signal is sensible,
>>> all the objects look well-formed (i.e. like they contain all the bits
>>> they should contain).... its very confusing and frustrating.
>>>
>>> So, am I doing something wrong? Or is there a deeper problem with this
>>> package?
>>>
>>> I'm using R v 2.13.1, edgeR 2.2.5 and the commands I'm using are:
>>>
>>> # create the design matrix
>>>> model.groups<-groups
>>>> model.factors<-as.factor(**model.groups)
>>>> model<-model.matrix(~model.**factors)
>>>>
>>>> # build DGElist and calculate normalization factors
>>>> x=as.data.frame(data)
>>>> rownames(x)=genenames
>>>> y=DGEList(x,group=groups)
>>>> y=calcNormFactors(y)
>> Cheers,
>>
>> Nick Schurch
>>
>> Data Analysis Group (The Barton Group),
>> School of Life Sciences,
>> University of Dundee,
>> Dow St,
>> Dundee,
>> DD1 5EH,
>> Scotland,
>> UK
>>
>> Tel: +44 1382 388707
>> Fax: +44 1382 345 893
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}
More information about the Bioconductor
mailing list