[BioC] how edgeR control the outliers?

Gordon K Smyth smyth at wehi.EDU.AU
Sat Jan 28 01:57:48 CET 2012


Dear Yuan,

The edgeR empirical-Bayes algorithm is actually somewhat resistant to
outliers, because it allows for gene-specific variability, unlike
algorithms than treat the variance as a function of the mean.  However the
edgeR algorithm is designed to deal more with routine gene-specific
biological variation than with true outliers.  We prefer to detect and
remove true outliers in the data checking steps rather than accommodate
them as part of the dispersion estimation algorithm.

Let me say that the second paragraph of your email is hard to understand, 
because cannot give either median or mean counts to edgeR.  You must give 
actual read counts.  I wonder if your problems are not caused by inputing 
inappropriate data into edgeR?

edgeR has a great many options, and it would certainly help in writing a 
response to know which ones you are using already.  For the purposes of 
this email, I am going to assume that you are doing a valid analysis using 
true read counts, and that you have used either estimateTagwiseDisp() or 
estimateGLMTagwiseDisp() with default settings.

If you do have some substantial outliers, here are some options:

1. First, filter genes before analysis as in the edgeR User's Guide case 
studies that deal with RNA-Seq data.  Suppose that you have 4 control 
libraries and 4 experimental: then keep genes only if they satisfy a 
minimum count-per-million (cpm>1 say) in a least four samples.  This 
eliminates genes with RNA-Seq artifacts such that they are zero except in 
one or two samples.

2. Plot trended and tagwise dispersion estimates against abundance to look 
for outliers.

3. Test for outliers using the gof() function.

4. Reduce the prior.n setting to a smaller value.

If none of this solves your problems, you might try the voom() function in 
the limma package instead.  (See the limma User's Guide.)  This approach 
is more flexible in adapting automatically to gene-specific variability in 
RNA-Seq data than the edgeR algorithm, and has proved successful on some 
high-variability datasets.

Best wishes
Gordon


> Date: Thu, 26 Jan 2012 21:19:55 -0800
> From: Yuan Tian <ytianidyll at ucla.edu>
> To: Bioconductor mailing list <bioconductor at r-project.org>
> Subject: [BioC] how edgeR control the outliers?
>
> Dear all,
>
> I use edgeR for differential expression analysis on a RNAseq dataset. 
> But I found that edgeR is very sensitive to outlier samples. For 
> example, for one gene, overall the expression pattern is similar between 
> control group and experimental group, but there is one single sample 
> which behaves very differently from the others, then this gene is very 
> likely to be falsely detected as differentially expressed. So can anyone 
> please tell me if there's any option in the algorithm that can control 
> the outlier impact?
>
> I'm thinking to use median read count value instead of mean read count 
> value to fit the NB distribution, and to estimate the dispersions. Just 
> wondering if there's an option available in edgeR? Or is there any other 
> RNAseq DE analysis package which is less sensitive to outliers?
>
> The outlier sample might be different when you look at different genes, 
> so we can't take the whole sample out in the analysis.
>
> Yuan

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list