[BioC] how edgeR control outliers?

Yuan Tian ytianidyll at ucla.edu
Fri Mar 2 05:26:04 CET 2012

```Dear Gordon,

I did the qqplot following the instructions in your last email, and I got a plot as attached. How can we interpret the results. According to the gof() function with 0.1 adjusted p value cutoff, no genes are detected as the outlier genes, but according to the qqplot, the fit seems to be not very well.

Here I use tagwise dispersion values.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2012-03-01 at 8.25.38 PM.png
Type: image/png
Size: 28854 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20120301/9a6c52ea/attachment.png>
-------------- next part --------------

Yuan

On Mar 1, 2012, at 2:50 PM, Gordon K Smyth wrote:

> Dear Yuan,
>
> The deviance is a standard quantity in generalized linear model theory, analogous to the residual sum of squares in ANOVA.  It is usually treated as chisquare distributed, although this approximation can be rough in some cases.  See for example:
>
>  http://en.wikipedia.org/wiki/Deviance_(statistics)
>
> Yes, when I said to test for outliers using the gof() function in
>
>  https://stat.ethz.ch/pipermail/bioconductor/2012-January/043187.html
>
> I meant that outliers are those with large gof statistics.  The calculation of p-values to test for outliers is already done for you by the gof() function.
>
> Figure 2 of the following article provides some plots of gof() statistics:
>
>  http://nar.oxfordjournals.org/content/early/2012/01/28/nar.gks042
>
> The plots are made by
>
> g <- gof(fit)
> z <- zscoreGamma(g\$gof.statistics,shape=gof\$df/2,scale=2)
> qqnorm(z)
>
> Another very useful diagnostic is to plot the tagwise dispersion against abundance.  Outliers may appear as large dispersions.  In the developmental version of edgeR, there is a function plotBCV() provided to do this.
>
> Best wishes
> Gordon
>
>> Date: Wed, 29 Feb 2012 20:09:06 -0800
>> From: Yuan Tian <ytianidyll at ucla.edu>
>> To: Bioconductor mailing list <bioconductor at r-project.org>
>> Subject: [BioC] how edgeR control outliers?
>>
>> Dear all,
>>
>> I'm currently using edgeR to detect the differentially expressed genes from a RNAseq datasets, and I'm also using the gof() function to test for potential outliers. I have two questions regarding the outlier detection, and would like to have your suggestions.
>>
>> 1) How the outlier is defined? Is it the gene that have a deviance larger than a threshold? How is the deviance contained in the glmfit data calculated?
>>
>> 2) In gof() function, it assumes the deviance should follow a chi-squared distribution. But what is the statistic basis for this assumption?
>>
>> Thanks!
>>
>> Yuan
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:9}}
```