[BioC] how edgeR control outliers?

Sun Mar 4 00:46:00 CET 2012

Dear Yuan,

Data analysis decisions are not made on the basis of one picture, and I 
have not seen your other plots.  However, the qqnorm plot suggests to me 
that you do not actually have outliers, because there are no individual 
points that stand out.  Rather you have an extraordinarily large degree of 
diversity in the tagwise dispersions, as evidenced by a large number of 
qqnorm points above the line in the upper half of the plot.  From an edgeR 
point of view, I would suggest using a smaller value for prior.n.  From a 
biological point of view, I would wonder whether the two groups you are 
comparing are truly homogeneous.  I would wonder whether the tagwise 
dispersions are reflectly differential expression with groups.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Thu, 1 Mar 2012, Yuan Tian wrote:

Dear Gordon,

I did the qqplot following the instructions in your last email, and I got 
a plot as attached. How can we interpret the results. According to the 
gof() function with 0.1 adjusted p value cutoff, no genes are detected as 
the outlier genes, but according to the qqplot, the fit seems to be not 
very well.

Here I use tagwise dispersion values.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2012-03-01 at 8.25.38 PM.png
Type: image/png
Size: 28854 bytes
Desc: not available
URL: 
<https://stat.ethz.ch/pipermail/bioconductor/attachments/20120301/9a6c52ea/attachment.png>
-------------- next part --------------

Yuan

On Mar 1, 2012, at 2:50 PM, Gordon K Smyth wrote:

> Dear Yuan,
>
> The deviance is a standard quantity in generalized linear model theory, 
analogous to the residual sum of squares in ANOVA.  It is usually treated 
as chisquare distributed, although this approximation can be rough in some 
cases.  See for example:
>
>  http://en.wikipedia.org/wiki/Deviance_(statistics)
>
> Yes, when I said to test for outliers using the gof() function in
>
>  https://stat.ethz.ch/pipermail/bioconductor/2012-January/043187.html
>
> I meant that outliers are those with large gof statistics.  The 
calculation of p-values to test for outliers is already done for you by 
the gof() function.
>
> Figure 2 of the following article provides some plots of gof() 
statistics:
>
>  http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks042
>
> The plots are made by
>
> g <- gof(fit)
> z <- zscoreGamma(g$gof.statistics,shape=gof$df/2,scale=2)
> qqnorm(z)
>
> Another very useful diagnostic is to plot the tagwise dispersion against 
abundance.  Outliers may appear as large dispersions.  In the 
developmental version of edgeR, there is a function plotBCV() provided to 
do this.
>
> Best wishes
> Gordon
>
>> Date: Wed, 29 Feb 2012 20:09:06 -0800
>> From: Yuan Tian <ytianidyll at ucla.edu>
>> To: Bioconductor mailing list <bioconductor at r-project.org>
>> Subject: [BioC] how edgeR control outliers?
>>
>> Dear all,
>>
>> I'm currently using edgeR to detect the differentially expressed genes 
from a RNAseq datasets, and I'm also using the gof() function to test for 
potential outliers. I have two questions regarding the outlier detection, 
and would like to have your suggestions.
>>
>> 1) How the outlier is defined? Is it the gene that have a deviance 
larger than a threshold? How is the deviance contained in the glmfit data 
calculated?
>>
>> 2) In gof() function, it assumes the deviance should follow a 
chi-squared distribution. But what is the statistic basis for this 
assumption?
>>
>> Thanks!
>>
>> Yuan

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}