[BioC] prior df estimation of estimateDisp function

Gordon K Smyth smyth at wehi.EDU.AU
Wed Feb 5 23:49:52 CET 2014


> Date: Tue, 4 Feb 2014 23:25:22 +0100
> From: Koen Van den Berge <koen.vdberge at gmail.com>
> To: Gordon K Smyth <smyth at wehi.EDU.AU>
> Cc: Bioconductor mailing list <bioconductor at r-project.org>
> Subject: Re: [BioC] prior df estimation of estimateDisp function
>
> Dear all,
>
> Thank you, Gordon, for this useful reply. I am currently still looking 
> into effects of different dispersion estimators and different settings 
> for these estimators.

> Again, I am working on the spiked-in data, also mentioned in the Voom
> paper, as mentioned below.

This is an unfortunate choice of dataset.  It is hard to learn much about
dispersion estimation from a dataset than doesn't have any dispersion, or
almost none.

> Since the estDisp() function suggested a prior df value of infinity - 
> giving total weight to the prior distribution (which would be the 
> trended dispersion estimator, I guess?)

I guess you mean estimateDisp().  Yes.

> I had hoped to reach the same results in comparing the libraries as I 
> had when analysing the data using the trended dispersion estimator.

estimateGLMTrendedDisp() gives a different estimate of the trended 
dispersion to estimateDisp().

You can see this for example by plotBCV().

Gordon

> However, I get different numbers of DE genes when analysing with the 
> estDisp() function as when I do so when analysing with the trended 
> dispersion estimator. I have added a small table in appendix with the 
> results in doing so. While I was thinking on how this could be possible, 
> an idea would be that the GoF plots were largely different and this 
> might possibly have been the reason for the observed differences in 
> number of DE genes. However, it does not look like this is the case. I 
> have also added the plots in appendix.
>
> If anybody would be able in enlightening me further through this 
> problem, it would be a great help and another step forward in fully 
> understanding the edgeR package.
>
> Thank you in advance,
> Koen Van den Berge
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: results_priors_spiked.pdf
> Type: application/pdf
> Size: 14944 bytes
> Desc: not available
> URL:
> 
<https://stat.ethz.ch/pipermail/bioconductor/attachments/20140204/13f946a0
> /attachment-0003.pdf>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: gof_inf.pdf
> Type: application/pdf
> Size: 12797 bytes
> Desc: not available
> URL:
> 
<https://stat.ethz.ch/pipermail/bioconductor/attachments/20140204/13f946a0
> /attachment-0004.pdf>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: gof_trend.pdf
> Type: application/pdf
> Size: 12754 bytes
> Desc: not available
> URL:
> 
<https://stat.ethz.ch/pipermail/bioconductor/attachments/20140204/13f946a0
> /attachment-0005.pdf>
> -------------- next part --------------
>
>

> On 03 Feb 2014, at 23:35, Gordon K Smyth <smyth at wehi.EDU.AU> wrote:
>
>> Dear Koen,
>>
>> The results you are seeing are as one would expect for a technical
>> datasets.  Remember that the dispersion measures biological variation.
>> The replicates here are just re-sequencing (and perhaps some
>> re-preparation) of exactly the same RNA samples, so there is no
>> biological variation apart from slight inaccuracies in the preparation
>> of the samples.  Hence the tagwise dispersions will be very small (or
>> even zero) and nearly equal. Hence the prior df will be estimated to be
>> large or even infinity.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Sun, 2 Feb 2014 12:16:05 +0100
>>> From: Koen Van den Berge <koen.vdberge at gmail.com>
>>> To: Bioconductor mailing list <bioconductor at r-project.org>
>>> Subject: [BioC] prior df estimation of estimateDisp function
>>>
>>> Dear Bioconductors,
>>>
>>> In evaluating several RNA-seq analysis techniques, I decided to
>>> analyse the same SEQC spiked-in dataset as mentioned in the limma-Voom
>>> paper (Law et al. 2014, Genome Biology).
>>
>>> This dataset contains 92 genes, 69 of which are spiked-in at different
>>> concentrations. As in the paper, I copied the 23 non DE genes twice,
>>> providing me a dataset with a total of 138 genes, half of which are
>>> (non) DE.
>>
>>> The authors use a filter on the dataset that requires a cpm higher
>>> than 1 in at least 4 libraries.
>>
>>> First I decided to analyse the data using EdgeR:
>>>
>>> spiked <- read.csv(filename)
>>>
>>> #Copy non DE genes
>>> copies <- spiked[id,]
>>> spiked2 <- rbind(spiked,copies)
>>> spiked3 <- rbind(spiked2,copies)
>>>
>>> #Design matrix
>>> libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4))
>>> design <- model.matrix(~ libraries)
>>>
>>> #make DGElist and normalize
>>> y <- DGEList(spiked3[,-1], group= libraries)
>>> y$samples$norm.factors <- normFactorsAll
>>>
>>> #estimate dispersion
>>> estimateDisip(y , design)
>>>
>>> In doing so, the estimateDisp() function provided me an estimate of
>>> prior df of 58.90.
>>>
>>> In assessing the effect of the filtering process, I followed the same
>>> steps as above, however not applying the filtering process. The
>>> estimateDisp() function then provided me an estimate of prior df of
>>> Infinity, suggesting the trended dispersion estimation should be used.
>>> Why is this happening? Is the ML dispersion estimation sensitive to
>>> these low counts, resulting in an estimate of infinity for the prior
>>> degrees of freedom?
>>
>>> I can not really grasp this result, so any suggestions or help in this
>>> topic are very welcome.
>>>
>>> Sincerely,
>>> Koen.

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list