[Bioc-sig-seq] RNASeq, differential expression between group, and large variance within groups

Mon Feb 21 20:34:00 CET 2011

Dear Laurant

On 02/21/2011 03:36 PM, Laurent Gautier wrote:
> We are looking at tag-based RNASeq data, and after running popular 
> packages for finding differential expression (edgeR, and DEGseq) we were

> looking that the actual counts for the significant ones.
> 
> We are observing a somewhat extreme variance within each group for those

> (say one sample with high count for gene X while others have zero
count).
> 
> For example, gene X flagged as differentially expressed has the 
> following counts (adjusted p-value with DGESeq is 9.401479e-10):
> 0 grp_A
> 0 grp_A
> 0 grp_A
> 92207 grp_B
> 0 grp_B
> 0 grp_B
> 
> The underlying binomial is obviously not like the almost-Gaussian 
> assumed in microarrays/t-test-like approaches, but this kind of outcome 
> is somehow intriguing me. Do people here have experience to share 
> regarding how well such gene hold through the qPCR verification step ?

I have seen such genes as well in my data sets, and I am in fact worried
that DESeq does not do a too great job handling them. The vignette talks
about this issue, which I term "variance outliers", but maybe does not put
enough emphasis on it. I'm currently working on improvements.

The origin of the issue is this: DESeq is based on the assumption that
genes of similar expression strength have roughly the same variance. If one
has only few replicates, this is the best one can do, because even if all
genes of a certain expression strength have variances differing by factors
of 10 or so, we would not be able to see that from the per-gene estimate of
variance that we use, because the chi-square distribution for a variance
estimated from just three replicates (as in your case) is so wide that such
discrepancies are to be expected. However, there are occasionally genes
with such a large sample variance that the basic assumption that this gene
has the same true variance as its neighbours becomes implausible. I
suggest, for the moment, to exclude these genes.

To help you flag them, the results data frame returned by 'nbinom.test'
contains the columns 'resVarA' and 'resVarB', which give the "variance
residuals", i.e., the ratio of per-gene variance over fitted variance.
Accept as hits only those genes with an adjusted p value below your chosen
FDR threshold _and_ a residual variance below some threshold.

How to chose this threshold? I am not quite sure about this, to be honest,
and this is why the vignette is a bit vague on the issue. For now, I
suggest to have a look at the plot generated by

  plot( pmax( res$resVarA, res$resVarB ), res$pval )

You will notice that to the left of the plot, the p values spread over the
whole range from 0 to 1, as you would expect, but to the very right, there
are a few genes that all have low p value,. This is the regime where type-I
error control breaks down. I suggest to judge by eye where this starts to
be the case and discount the genes further to the right.

In most data sets these are only very few genes, but still, it is not a
fully satisfactory state of affair. I recently tested how edgeR deals with
the issue and found that it does not do a much better job in handling such
genes unless you have a large number of replicates. 

Certainly, we need a better approach and I am currently testing some
ideas, and may have something soon. For now, just cut out the variance
outliers.

Cheers
  Simon

+---
| Dr. Simon Anders, Dipl.-Phys.
| European Molecular Biology Laboratory (EMBL), Heidelberg
| office phone +49-6221-387-8632
| preferred (permanent) e-mail: sanders at fs.tum.de