[BioC] prior df estimation of estimateDisp function

Koen Van den Berge koen.vdberge at gmail.com
Tue Feb 4 23:25:22 CET 2014


Dear all,

Thank you, Gordon, for this useful reply. I am currently still looking into effects of different dispersion estimators and different settings for these estimators. 
Again, I am working on the spiked-in data, also mentioned in the Voom paper, as mentioned below.
Since the estDisp() function suggested a prior df value of infinity - giving total weight to the prior distribution (which would be the trended dispersion estimator, I guess?) I had hoped to reach the same results in comparing the libraries as I had when analysing the data using the trended dispersion estimator.
However, I get different numbers of DE genes when analysing with the estDisp() function as when I do so when analysing with the trended dispersion estimator. I have added a small table in appendix with the results in doing so. While I was thinking on how this could be possible, an idea would be that the GoF plots were largely different and this might possibly have been the reason for the observed differences in number of DE genes. However, it does not look like this is the case. I have also added the plots in appendix.

If anybody would be able in enlightening me further through this problem, it would be a great help and another step forward in fully understanding the edgeR package.

Thank you in advance,
Koen Van den Berge
-------------- next part --------------
A non-text attachment was scrubbed...
Name: results_priors_spiked.pdf
Type: application/pdf
Size: 14944 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140204/13f946a0/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gof_inf.pdf
Type: application/pdf
Size: 12797 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140204/13f946a0/attachment-0001.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gof_trend.pdf
Type: application/pdf
Size: 12754 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140204/13f946a0/attachment-0002.pdf>
-------------- next part --------------




On 03 Feb 2014, at 23:35, Gordon K Smyth <smyth at wehi.EDU.AU> wrote:

> Dear Koen,
> 
> The results you are seeing are as one would expect for a technical datasets.  Remember that the dispersion measures biological variation. The replicates here are just re-sequencing (and perhaps some re-preparation) of exactly the same RNA samples, so there is no biological variation apart from slight inaccuracies in the preparation of the samples.  Hence the tagwise dispersions will be very small (or even zero) and nearly equal. Hence the prior df will be estimated to be large or even infinity.
> 
> Best wishes
> Gordon
> 
>> Date: Sun, 2 Feb 2014 12:16:05 +0100
>> From: Koen Van den Berge <koen.vdberge at gmail.com>
>> To: Bioconductor mailing list <bioconductor at r-project.org>
>> Subject: [BioC] prior df estimation of estimateDisp function
>> 
>> Dear Bioconductors,
>> 
>> In evaluating several RNA-seq analysis techniques, I decided to analyse the same SEQC spiked-in dataset as mentioned in the limma-Voom paper (Law et al. 2014, Genome Biology).
> 
>> This dataset contains 92 genes, 69 of which are spiked-in at different concentrations. As in the paper, I copied the 23 non DE genes twice, providing me a dataset with a total of 138 genes, half of which are (non) DE.
> 
>> The authors use a filter on the dataset that requires a cpm higher than 1 in at least 4 libraries.
> 
>> First I decided to analyse the data using EdgeR:
>> 
>> spiked <- read.csv(filename)
>> 
>> #Copy non DE genes
>> copies <- spiked[id,]
>> spiked2 <- rbind(spiked,copies)
>> spiked3 <- rbind(spiked2,copies)
>> 
>> #Design matrix
>> libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4))
>> design <- model.matrix(~ libraries)
>> 
>> #make DGElist and normalize
>> y <- DGEList(spiked3[,-1], group= libraries)
>> y$samples$norm.factors <- normFactorsAll
>> 
>> #estimate dispersion
>> estimateDisip(y , design)
>> 
>> In doing so, the estimateDisp() function provided me an estimate of prior df of 58.90.
>> 
>> In assessing the effect of the filtering process, I followed the same steps as above, however not applying the filtering process. The estimateDisp() function then provided me an estimate of prior df of Infinity, suggesting the trended dispersion estimation should be used. Why is this happening? Is the ML dispersion estimation sensitive to these low counts, resulting in an estimate of infinity for the prior degrees of freedom?
> 
>> I can not really grasp this result, so any suggestions or help in this topic are very welcome.
>> 
>> Sincerely,
>> Koen.
> 
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:9}}


More information about the Bioconductor mailing list