[BioC] bug in estimateDisp() or mistake in documentation?

Wed Jul 16 03:23:13 CEST 2014

Hi Jenny,

The estimateDisp() function is correct.  It does in fact use the qCML 
dispersion estimates when there is no design matrix, and it does this even 
though a design matrix is computed internally.  However it doesn't give 
exactly the same estimates as estimateCommonDisp and estimateTagwiseDisp 
because it is a complete rewrite of the approach with somewhat different 
internal grid parameter settings.  I agree that this last point is 
potentially confusing and is not adequately explained in the 
documentation.  Our intention actually is to consolidate the functions so 
that estimateDisp() is always the engine even when estimateTagwiseDisp() 
is called.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

On Tue, 15 Jul 2014, Zadeh, Jenny Drnevich wrote:

> Hello,
>
> I've found either a bug in estimateDisp() or a mistake in the 
> documentation about when it uses the classic qCML vs. the GLM CR 
> dispersion estimates. The help page for estimateDisp() says

> "If there is no design matrix, it calculates the quantile conditional 
> likelihood for each gene (tag) and then maximize it. The method is same 
> as in the function estimateCommonDisp and estimateTagwiseDisp".

> However, when you call estimateDisp() without specifying a design matrix 
> but with a DGEList object with more than one level in samples$group, it 
> internally calculates a design matrix and gives you 
> common/trended/tagwise dispersion estimates. Actually, even when 
> samples$group has only one level, estimateDisp still internally creates 
> a unit vector design matrix and outputs common/trended/tagwise 
> dispersion estimates. So is it a bug in the code or a mistake in the 
> documentation and estimateDisp() can't be used to get the same results 
> as estimateCommonDisp() and estimateTagwiseDisp()? (example below)
>
> It was my understanding that for a single factor experiment (2 or more 
> groups), the classic qCML was preferred over the GLM dispersion. 
> However, if you have more than 2 groups and you want to calculate a 
> oneway ANOVA or specialized contrasts, it seems you have to use the GLM 
> approach. Is the GLM approach therefore becoming the best method for all 
> experimental designs, even those with only a single factor?
>
> Thanks,
> Jenny
>
> #example - data as indicated in edgeRUsersGuide() section 4.3
>> library(edgeR)
> Loading required package: limma
>> shell.exec(system.file("doc","edgeRUsersGuide.pdf",package="edgeR"))
>>
>> targets <- readTargets()
>> targets
>     Lane Treatment Label
> Con1    1   Control  Con1
> Con2    2   Control  Con2
> Con3    2   Control  Con3
> Con4    4   Control  Con4
> DHT1    5       DHT  DHT1
> DHT2    6       DHT  DHT2
> DHT3    8       DHT  DHT3
>> x <- read.delim("pnas_expression.txt", row.names=1, stringsAsFactors=FALSE)
>> y <- DGEList(counts=x[,1:7], group=targets$Treatment, genes=data.frame(Length=x[,8]))
>> colnames(y) <- targets$Label
>> keep <- rowSums(cpm(y)>1) >= 3
>> y <- y[keep,]
>> y$samples$lib.size <- colSums(y$counts)
>> y <- calcNormFactors(y)
>> y$samples
>       group lib.size norm.factors
> Con1 Control   976847    1.0296636
> Con2 Control  1154746    1.0372521
> Con3 Control  1439393    1.0362662
> Con4 Control  1482652    1.0378383
> DHT1     DHT  1820628    0.9537095
> DHT2     DHT  1831553    0.9525624
> DHT3     DHT   680798    0.9583181
>>
>> y2 <- y
>>
>> y <- estimateCommonDisp(y, verbose=TRUE)
> Disp = 0.02002 , BCV = 0.1415
>> y <- estimateTagwiseDisp(y)
>> y2 <- estimateDisp(y2)
> Loading required package: splines
>>
>> names(y)
> [1] "counts"             "samples"            "genes"
> [4] "common.dispersion"  "pseudo.counts"      "pseudo.lib.size"
> [7] "AveLogCPM"          "prior.n"            "tagwise.dispersion"
>> names(y2)
> [1] "counts"             "samples"            "genes"
> [4] "common.dispersion"  "trended.dispersion" "trend.method"
> [7] "AveLogCPM"          "span"               "tagwise.dispersion"
> [10] "prior.df"           "prior.n"
>>
>> all.equal(y$tagwise.dispersion, y2$tagwise.dispersion)
> [1] "Mean relative difference: 0.1156227"
>>
>> et <- exactTest(y)
>> summary(decideTestsDGE(et))
>   [,1]
> -1  2096
> 0  12059
> 1   2339
>>
>> et2 <- exactTest(y2)
>> summary(decideTestsDGE(et2))
>   [,1]
> -1  2078
> 0  12084
> 1   2332
>>
>>
>> sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] locfit_1.5-9.1       edgeR_3.6.6          limma_3.20.8
> [4] BiocInstaller_1.14.2
>
> loaded via a namespace (and not attached):
> [1] grid_3.1.1      lattice_0.20-29 tools_3.1.1
>
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist
> High Performance Biological Computing Program
> and The Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
> NOTE NEW OFFICE LOCATION
> 2112 IGB
> 1206 W. Gregory Dr.
> Urbana, IL 61801
> USA
> ph: 217-300-6543
> fax: 217-265-5066
> e-mail: drnevich at illinois.edu<mailto:drnevich at illinois.edu>
>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}