[BioC] edgeR vs DESeq for comparison without replicate
Woo, Sangsoon
swoo at fhcrc.org
Sat Jul 9 16:00:56 CEST 2011
Dear David,
Thank you so~ much for your explanation and details.
I know that it's always a difficult problem when we do not have replicates.
I think that I'd better use ranking genes based on their logFC instead of establishing p-value threshold.
At least we can see biological differences. Of course, we need to be careful for few reads only one group.
I will take a look at Gordon's discussion.
Thanks again.
Sangsoon
----- Original Message -----
From: "Davis McCarthy" <dmccarthy at wehi.EDU.AU>
To: "Sangsoon Woo" <swoo at fhcrc.org>
Cc: bioconductor at r-project.org
Sent: Saturday, July 9, 2011 1:11:02 AM
Subject: Re: [BioC] edgeR vs DESeq for comparison without replicate
Dear Sangsoon
Analysing count data for significance without replicates is always
somewhat problematic. Experience tells us that genomic count data
(ChIP-Seq, RNA-Seq, etc.) has substantial variability, more than a Poisson
distribution is able to account for. However, if you do not have
replicates then it is not possible to account for the extra-Poisson
variability (overdispersion) in a completely satisfying way.
I don't think that there really is an answer to the question of which of
edgeR or DESeq is "better" for analysing data without replicates. Given
that both packages assess significance using Robinson & Smyth's exact test
(Biostatistics, 2008), both will give essentially the same significance
results if the dispersion modeling is the same.
Now, in this case, you are using very different dispersion modeling
approaches in edgeR and DESeq, so the results are not all that comparable.
There have been discussions previously on this mailing list that suggest
using the NB assuming there is no difference b/w samples to roughly
estimate the dispersion in both edgeR and DESeq.
The results that you describe are not surprising. The edgeR analysis that
you did is a Poisson model analysis, which we would expect to yield many
significant DE genes. The DESeq analysis that you have described (and
which I would probably also normally recommend as a better approach to use
in edgeR) roughly estimates the dispersion---once you allow for some
variability in the data you see no DE. Again this is not unexpected
behaviour.
There is currently another thread on Bioconductor in which Gordon has
discussed more strategies for analysis when there are no replicates. I
recommend that you have a look at his thoughts there.
What you haven't told us is the size of the dispersion estimates that
DESeq is using. In my experience (common) dispersion values for biological
replicate data are often in the range of 0.1-0.6. If the dispersion values
that you are using are much higher than this then I would be looking at
things much more closely.
Fundamentally, however, assessing statistical significance without
replicate samples is very difficult - it's a lot to ask of a software
package to pull out sensible DE genes without replication. I am somewhat
relieved that the DESeq approach you took, and tagwise dispersions in
edgeR yield no DE genes.
In the end, robust statistical inference on differential expression
requires (biologically) replicate samples, and there's no easy way around
that.
Best wishes
Davis
> Dear all,
>
> I am working on a ChIP-Seq data set.
> I want to compare two groups having only one sample each group. (no
> replicates in both group)
> I generated count matrix which element is the number of reads within gene
> region for each data set.
>
> I applied edgeR and DESeq methods for this comparison.
>
> For this case,
> 1. edgeR uses Poisson by setting common.disp=1e-6 (zero).
> 2. DESeq still uses NB by assuming there is no difference b/w two samples
> to estimate dispersion.
>
> The results are
> 1. edgeR identifies many genes with very small p-values / adjusted p-value
> when I used common.disp approach.
> 2. edgeR gives none significant genes with tagwise.disp option.
> 3. DESeq does not identify any significant gene.
>
> I think that p-values of #2 and #3 are based on summing over all sums of
> counts that have a probability less than the probability under the null
> hypothesis of the observed sum of counts. But #1 is based on Poisson
> distribution with very small variation than actual data.
> Am I right?
> Looking at the raw counts for top genes is not helpful because it is just
> comparing two numbers.
>
> Which package is better for the case without replicate based on your
> experiences?
>
> Thanks for your help in advance.
> Sangsoon
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia.
dmccarthy at wehi.edu.au
http://www.wehi.edu.au
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list