[BioC] bug in estimateDisp() or mistake in documentation?
Gordon K Smyth
smyth at wehi.EDU.AU
Wed Jul 16 03:23:13 CEST 2014
Hi Jenny,
The estimateDisp() function is correct. It does in fact use the qCML
dispersion estimates when there is no design matrix, and it does this even
though a design matrix is computed internally. However it doesn't give
exactly the same estimates as estimateCommonDisp and estimateTagwiseDisp
because it is a complete rewrite of the approach with somewhat different
internal grid parameter settings. I agree that this last point is
potentially confusing and is not adequately explained in the
documentation. Our intention actually is to consolidate the functions so
that estimateDisp() is always the engine even when estimateTagwiseDisp()
is called.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
On Tue, 15 Jul 2014, Zadeh, Jenny Drnevich wrote:
> Hello,
>
> I've found either a bug in estimateDisp() or a mistake in the
> documentation about when it uses the classic qCML vs. the GLM CR
> dispersion estimates. The help page for estimateDisp() says
> "If there is no design matrix, it calculates the quantile conditional
> likelihood for each gene (tag) and then maximize it. The method is same
> as in the function estimateCommonDisp and estimateTagwiseDisp".
> However, when you call estimateDisp() without specifying a design matrix
> but with a DGEList object with more than one level in samples$group, it
> internally calculates a design matrix and gives you
> common/trended/tagwise dispersion estimates. Actually, even when
> samples$group has only one level, estimateDisp still internally creates
> a unit vector design matrix and outputs common/trended/tagwise
> dispersion estimates. So is it a bug in the code or a mistake in the
> documentation and estimateDisp() can't be used to get the same results
> as estimateCommonDisp() and estimateTagwiseDisp()? (example below)
>
> It was my understanding that for a single factor experiment (2 or more
> groups), the classic qCML was preferred over the GLM dispersion.
> However, if you have more than 2 groups and you want to calculate a
> oneway ANOVA or specialized contrasts, it seems you have to use the GLM
> approach. Is the GLM approach therefore becoming the best method for all
> experimental designs, even those with only a single factor?
>
> Thanks,
> Jenny
>
> #example - data as indicated in edgeRUsersGuide() section 4.3
>> library(edgeR)
> Loading required package: limma
>> shell.exec(system.file("doc","edgeRUsersGuide.pdf",package="edgeR"))
>>
>> targets <- readTargets()
>> targets
> Lane Treatment Label
> Con1 1 Control Con1
> Con2 2 Control Con2
> Con3 2 Control Con3
> Con4 4 Control Con4
> DHT1 5 DHT DHT1
> DHT2 6 DHT DHT2
> DHT3 8 DHT DHT3
>> x <- read.delim("pnas_expression.txt", row.names=1, stringsAsFactors=FALSE)
>> y <- DGEList(counts=x[,1:7], group=targets$Treatment, genes=data.frame(Length=x[,8]))
>> colnames(y) <- targets$Label
>> keep <- rowSums(cpm(y)>1) >= 3
>> y <- y[keep,]
>> y$samples$lib.size <- colSums(y$counts)
>> y <- calcNormFactors(y)
>> y$samples
> group lib.size norm.factors
> Con1 Control 976847 1.0296636
> Con2 Control 1154746 1.0372521
> Con3 Control 1439393 1.0362662
> Con4 Control 1482652 1.0378383
> DHT1 DHT 1820628 0.9537095
> DHT2 DHT 1831553 0.9525624
> DHT3 DHT 680798 0.9583181
>>
>> y2 <- y
>>
>> y <- estimateCommonDisp(y, verbose=TRUE)
> Disp = 0.02002 , BCV = 0.1415
>> y <- estimateTagwiseDisp(y)
>> y2 <- estimateDisp(y2)
> Loading required package: splines
>>
>> names(y)
> [1] "counts" "samples" "genes"
> [4] "common.dispersion" "pseudo.counts" "pseudo.lib.size"
> [7] "AveLogCPM" "prior.n" "tagwise.dispersion"
>> names(y2)
> [1] "counts" "samples" "genes"
> [4] "common.dispersion" "trended.dispersion" "trend.method"
> [7] "AveLogCPM" "span" "tagwise.dispersion"
> [10] "prior.df" "prior.n"
>>
>> all.equal(y$tagwise.dispersion, y2$tagwise.dispersion)
> [1] "Mean relative difference: 0.1156227"
>>
>> et <- exactTest(y)
>> summary(decideTestsDGE(et))
> [,1]
> -1 2096
> 0 12059
> 1 2339
>>
>> et2 <- exactTest(y2)
>> summary(decideTestsDGE(et2))
> [,1]
> -1 2078
> 0 12084
> 1 2332
>>
>>
>> sessionInfo()
> R version 3.1.1 (2014-07-10)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] splines stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] locfit_1.5-9.1 edgeR_3.6.6 limma_3.20.8
> [4] BiocInstaller_1.14.2
>
> loaded via a namespace (and not attached):
> [1] grid_3.1.1 lattice_0.20-29 tools_3.1.1
>
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist
> High Performance Biological Computing Program
> and The Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
> NOTE NEW OFFICE LOCATION
> 2112 IGB
> 1206 W. Gregory Dr.
> Urbana, IL 61801
> USA
> ph: 217-300-6543
> fax: 217-265-5066
> e-mail: drnevich at illinois.edu<mailto:drnevich at illinois.edu>
>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list