[BioC] edgeR outlier question

Gordon K Smyth smyth at wehi.EDU.AU
Sun May 13 13:51:19 CEST 2012

I'm commenting just on point 5 below.

It is certainly true that edgeR is able to borrow information between 
genes, meaning that only part of a parameter is used up for estimating the 
dispersion for each gene, i.e., between 1 and 2 parameters total for gene.

However DESeq, from what I understand, uses either a global dispersion 
estimate or a purely genewise estimate for each gene.  Always one of the 
extremes, never anything in between.  So it estimates either 1 or 2 
parameters for each gene, not an intermediate quantity.


> Date: Tue, 08 May 2012 16:36:45 +0200
> From: Wolfgang Huber <whuber at embl.de>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] edgeR outlier question
> Dear Simon,
> the problem that such outliers occur has nothing to do with the number
> of replicates (as Alessandro also points out below; and I did not imply
> that in my post). However, the potential solutions to the problem have a
> lot to do with it.
> 1. With very many replicates, in principle you don't need to make any
> parametric assumptions, and you could just use a permutation or rank
> based method.
> 2. With an intermediate number of replicates, you could use a highly
> flexible family of distributions, such as Poisson-Tweedie (PT, suggested
> by Robert Castelo). For this, you need to fit 3 parameters per gene. In
> the tweeDESeq vignette, they look at RNA-Seq data from 69 HapMap
> individuals.
> 3. The Negative Binomial (NB) distribution (used in edgeR and DESeq) is
> a special case of PT, with 2 parameters per gene. Less replicates are
> sufficient to determine these (e.g. sharingMode = "gene-est-only"  in
> DESeq's estimateDispersions).
> 4. With few replicates (say, 2 vs 2), even two parameters per gene are
> too many to fit reliably. For these instances, 'information sharing'
> across genes is used, either assuming a common dispersion, or a
> mean-dependent one. In the extreme case, you end up with essentially 1
> parameter per gene (e.g. sharingMode = "fit-only" in DESeq's
> estimateDispersions).
> 5. An intermediate solution, that can be thought of as using somewhere
> between 1 and 2 parameters per gene, is the shrinkage approach used by
> edgeR, or sharingMode = "maximum" in DESeq, which we find to work well.
> I hope this overview helps. The current instances of software require
> you to select your analysis strategy between these 5 options manually.
> It would be interesting to see if that can be (or should be?) automated.
> PS In addition to the above, one can go back to the robust statistics
> textbook, stay with simple parametric distributions, but remove or
> down-weigh outlier data points (not necessarily the whole gene).
> 	Best wishes
> 	Wolfgang

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list