[BioC] edgeR: dispersion estimation

Gordon K Smyth smyth at wehi.EDU.AU
Wed May 14 01:47:18 CEST 2014


Dear Yanzhu,

Thanks for providing your data.  edgeR will run quickly and correctly on 
your data if you use the classic dispersion estimation functions instead 
of the GLM functions.  The classic dispersion estimators are correct for 
your design matrix, because you are including all interactions in your 
factorial model. You can run glmFit and glmLRT using the classic 
dispersion estimates if you wish, or you can do exactTests between the 
groups.  My code on your data is given below.

The GLM functions fail because you aren't doing enough filtering of genes 
with low counts.  Your data has 96 different groups.  Keeping genes with 
just one count means that 95 out of the 96 groups have zero counts for all 
cases and the other has just one count.  No proper estimate of the 
dispersion is possible for these genes, and it seems that the GLM Cox-Reid 
code doesn't fully cope with this.

Best wishes
Gordon

> x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene")
> x2 <- x[rowSums(x)>0,]
> logCPM <- cpm(x2,log=TRUE,prior.count=5)
> mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common")
> targets <- readTargets("newPHTSeqCondRE_05092014.txt")
> Rep <- factor(targets$Rep)
> Line <- factor(targets$Line)
> Sex <- factor(targets$Sex)
> Group <- paste(Line,Sex,Rep,sep=".")
> y <- DGEList(counts=x2,group=Group)
> y <- calcNormFactors(y,method="TMM")
> y <- estimateCommonDisp(y)
> y$common.disp
[1] 0.1308894
> y <- estimateTagwiseDisp(y)
> summary(y$tagwise.disp)
     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
0.002045 0.056520 0.163700 0.335500 0.389100 6.376000


On Fri, 9 May 2014, Gordon K Smyth wrote:

> Dear Yanzhu,
>
> Please send me your data offlist so that we can trouble-shoot.
>
> Best wishes
> Gordon
>
>> Date: Wed,  7 May 2014 11:29:40 -0700 (PDT)
>> From: "Yanzhu [guest]" <guest at bioconductor.org>
>> To: bioconductor at r-project.org, mlinyzh at gmail.com
>> Subject: [BioC] edgeR: dispersion estimation
>> 
>> Dear Gordon,
>> 
>> Sorry for replying late, I am sick and take some days off. The following 
>> are my answers to your questions:
>> 
>> I have checked they are all integers, they are read counts generated by 
>> HT-Seq (v0.5.3p9). When I type your command I get zero as below:
>> 
>> max(y$counts-floor(y$counts))
>> [1] 0
>> 
>> 
>> I have also posted the session information as below. Both the R and edgeR 
>> package are the most recent versions. However, the results are still the 
>> same as before.
>> 
>> Are there other possible reasons that would cause such issue?
>> 
>> 
>> Thank you very much for your help! I greatly appreciate it.
>> 
>> 
>> 
>> Yanzhu
>> 
>> 
>> ########################################
>> Dear Yanzhu,
>> 
>> I find it hard to believe that your data are all integers, because there
>> is something seriously wrong, and we have only ever seen the problem you
>> report when there are fractional counts.
>> 
>> How have you checked that they are all integers?  How were the counts
>> created?  What do you see if you type:
>>
>>   max(y$counts-floor(y$counts))
>> 
>> It also isn't clear that you have installed the current version of edgeR.
>> Have you installed R 3.1.0?  Can you please give the sessionInfo() output?
>> The current version of edgeR is 3.6.1, whereas you seem to be using
>> version 3.2.4, which is two Bioconductor releases out of date.
>> 
>> Gordon
>> 
>> 
>> 
>> -- output of sessionInfo():
>> 
>>> sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>> 
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
>> States.1252    LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C                           LC_TIME=English_United 
>> States.1252
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods 
>> base
>> 
>> other attached packages:
>> [1] DESeq_1.16.0         lattice_0.20-29      locfit_1.5-9.1 
>> Biobase_2.24.0       BiocGenerics_0.10.0
>> [6] edgeR_3.6.1          limma_3.20.1         BiocInstaller_1.14.2
>> 
>> loaded via a namespace (and not attached):
>> [1] annotate_1.42.0      AnnotationDbi_1.26.0 DBI_0.2-7 
>> genefilter_1.46.0    geneplotter_1.42.0
>> [6] GenomeInfoDb_1.0.2   grid_3.1.0           IRanges_1.22.6 
>> RColorBrewer_1.0-5   RSQLite_0.11.4
>> [11] splines_3.1.0        stats4_3.1.0         survival_2.37-7 
>> tools_3.1.0          XML_3.98-1.1
>> [16] xtable_1.7-3
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list