[BioC] edgeR outlier question
whuber at embl.de
Tue May 8 16:36:45 CEST 2012
the problem that such outliers occur has nothing to do with the number
of replicates (as Alessandro also points out below; and I did not imply
that in my post). However, the potential solutions to the problem have a
lot to do with it.
1. With very many replicates, in principle you don't need to make any
parametric assumptions, and you could just use a permutation or rank
2. With an intermediate number of replicates, you could use a highly
flexible family of distributions, such as Poisson-Tweedie (PT, suggested
by Robert Castelo). For this, you need to fit 3 parameters per gene. In
the tweeDESeq vignette, they look at RNA-Seq data from 69 HapMap
3. The Negative Binomial (NB) distribution (used in edgeR and DESeq) is
a special case of PT, with 2 parameters per gene. Less replicates are
sufficient to determine these (e.g. sharingMode = "gene-est-only" in
4. With few replicates (say, 2 vs 2), even two parameters per gene are
too many to fit reliably. For these instances, 'information sharing'
across genes is used, either assuming a common dispersion, or a
mean-dependent one. In the extreme case, you end up with essentially 1
parameter per gene (e.g. sharingMode = "fit-only" in DESeq's
5. An intermediate solution, that can be thought of as using somewhere
between 1 and 2 parameters per gene, is the shrinkage approach used by
edgeR, or sharingMode = "maximum" in DESeq, which we find to work well.
I hope this overview helps. The current instances of software require
you to select your analysis strategy between these 5 options manually.
It would be interesting to see if that can be (or should be?) automated.
PS In addition to the above, one can go back to the robust statistics
textbook, stay with simple parametric distributions, but remove or
down-weigh outlier data points (not necessarily the whole gene).
alessandro.guffanti at genomnia.com scripsit 05/08/2012 02:35 PM:
> Hi - actually this problem pops out even with as low as two replicates,
> and I would
> tend to attribute it to a technical feature of NGS (at least the ones in
> which there is
> an em-PCR step such as 454 and SOLiD) - which is over-amplification of a
> sequence set in a single sample. And this could be called shot noise I
> guess .. I sqw
> it both in SAGE and miRNA sequencing in multiple samples.
> I agree of course in principle on not throwing away genes for what
> happens sporadically in
> one sample. However, in my experience these 'read shots' always happens
> in the very grey area
> of few reads per samples, and if you reason in cpm this will be the area
> of less than 10 count
> per millions - I don't know it this is the same situation for you
> So, these are genes usually located in the area where biological
> variance is well hidden below
> technical variance. I guess that these will not be your most significant
> findings and the solution of
> reasoning with edgeR in terms of cmp for the threshold selection -
> rather than read counts even
> in normalized libraries - worked nicely for my miRNAs when I went back
> to MDS plots to explore
> the situation...
> This is only my experience, though, so I would be interested to know if
> this 'read shot noise' happens also
> in areas where there are large counts
> On 5/8/2012 3:22 AM, Simon Melov wrote:
>> Hi Alessandro,
>> I don't think this helps me, as I'm not looking to eliminate an entire
>> gene based on a single replicate. I mentioned in my original post that
>> I had applied the filtering discussed at length in the guide,
>> (allowing genes with at least one read, in a minimum of 8 samples was
>> my filtering criteria). But this doesn't address the problem of a very
>> high level of reads in a single sample. This issue of variance should
>> be incorporated into the analysis, and not result in genes being
>> listed as significant due to a high levels in a single sample. This
>> sort of problem is not unusual in the genomics world, and I think the
>> microarray literature had numerous solutions to this sort of problem.
>> I'm surprised it popped up so early in my analysis, as I thought this
>> would have been "solved" by now. As a later poster alluded to, perhaps
>> its due to a relatively "high" number of biological replicates (N=10
>> per group). This number of replicates going forward is going to be
>> commonplace as sequencing costs tumble. So some guidance as to how to
>> deal with this in edgeR would be very welcome.
More information about the Bioconductor