[BioC] edgeR and sagenhaft

Sat Feb 14 01:35:51 CET 2009

Hi Naomi.

Curious.  A bit difficult to diagnose without digging into it.  There is
probably a reasonable explanation for all of this.

For what its worth, a few comments/queries below.

> I have 4 large tag datasets  A1, A2 and B1, B2.  The purpose of the
> experiment was to determine differences in gene expression between A and
> B.
> A1 and B1 were done together as batch 1, and  A2 and B2 were done
> together as batch 2.

First question: are these technical replicates or biological?  If
technical, you may consider the 'doPoisson=TRUE' option of deDGE() since
that effectively sets r large (dispersion small), making it a Poisson
calculation.

> I several analyses and am completely puzzled.
>
> First I ran sage.test (Fisher's exact test) on A1, B1 and on A2,
> B2.  The results were strongly concordant in that there was a lot of
> overlap in the significant gene list,
> and the same genes were up/down regulated (on the whole).
>
> Then I ran edgeR on all 4 samples.  A large number of genes were
> declared significantly differentially expressed, but it was almost
> completely disjoint from the genes "found" by sage.test. (Fewer than
> 10 out of 4000).  The $r$ values were strongly clustered around 2,
> although some were huge.  Incidentally, the "exact" component of the
> output does not seem to be described in ?edgeR, but I understand it
> to be the p-value from the test.

'r' values around 2 suggest there is significant variation over and above
Poisson.  But, maybe this is due to batch effects.

Indeed, the 'exact' element is the p-value from the exact test proposed in
the paper.

What do you use for 'lib.size' -- total number of reads?  Are they
drastically different from batch-to-batch/sample-to-sample?  How do the
batch effects manifest -- more total reads giving higher overall counts,
or something different?

> Then I tested for batch effects by using sage.test on A1, A2 and  on
> B1, B2 and finally on A1 U B1 and A2 U B2.  A fairly large number of
> genes showed strong batch effects.  These overlapped more with the
> genotype within batch sage.test results than with the edgeR results.

Strong batch effects that aren't explained by total counts would result in
higher dispersion estimates (lower values of 'r') in edgeR, thus giving
fewer DE genes.  So, maybe this explains some of the lower overlap here.

> Just to make things more confusing, the grad student who ran the
> samples used the normal approximation to the Poisson to test genotype
> effects within batch.  These
> were highly concordant between batches as well, but did not match the
> sage.test results.  I thought the p-values would be similar at least
> for genes with large counts, but they were not.
>
> I am inclined to go with combining the sage.test results, but any
> advice would be very welcome

Not sure I've really contributed much, but there must be a reasonable
explanation.

Mark

>
> Thanks,
>
> Naomi S. Altman                                814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics                              814-863-7114 (fax)
> Penn State University                         814-865-1348 (Statistics)
> University Park, PA 16802-2111
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>