[BioC] edgeR: dispersion estimation

Gordon K Smyth smyth at wehi.EDU.AU
Thu May 15 00:55:29 CEST 2014


Yes.

Gordon

On Wed, 14 May 2014, Yanzhu Lin wrote:

> Dear Gordon,
>
> Thank you so much for your help. I greatly appreciate it.
>
> Just to make sure, so I can use classic dispersion estimators and then use
> glmFit and glmLRT to test the terms in my model?
>
> Best,
>
>
> Yanzhu
>
> On Tue, May 13, 2014 at 7:47 PM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Dear Yanzhu,
>>
>> Thanks for providing your data.  edgeR will run quickly and correctly on
>> your data if you use the classic dispersion estimation functions instead of
>> the GLM functions.  The classic dispersion estimators are correct for your
>> design matrix, because you are including all interactions in your factorial
>> model. You can run glmFit and glmLRT using the classic dispersion estimates
>> if you wish, or you can do exactTests between the groups.  My code on your
>> data is given below.
>>
>> The GLM functions fail because you aren't doing enough filtering of genes
>> with low counts.  Your data has 96 different groups.  Keeping genes with
>> just one count means that 95 out of the 96 groups have zero counts for all
>> cases and the other has just one count.  No proper estimate of the
>> dispersion is possible for these genes, and it seems that the GLM Cox-Reid
>> code doesn't fully cope with this.
>>
>> Best wishes
>> Gordon
>>
>>  x <- read.delim("newPHTSeqRE_05092014.txt",row.names="gene")
>>> x2 <- x[rowSums(x)>0,]
>>> logCPM <- cpm(x2,log=TRUE,prior.count=5)
>>> mds <- plotMDS(logCPM,labels=1:ncol(x2),gene.selection="common")
>>> targets <- readTargets("newPHTSeqCondRE_05092014.txt")
>>> Rep <- factor(targets$Rep)
>>> Line <- factor(targets$Line)
>>> Sex <- factor(targets$Sex)
>>> Group <- paste(Line,Sex,Rep,sep=".")
>>> y <- DGEList(counts=x2,group=Group)
>>> y <- calcNormFactors(y,method="TMM")
>>> y <- estimateCommonDisp(y)
>>> y$common.disp
>>>
>> [1] 0.1308894
>>
>>> y <- estimateTagwiseDisp(y)
>>> summary(y$tagwise.disp)
>>>
>>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
>> 0.002045 0.056520 0.163700 0.335500 0.389100 6.376000
>>
>>
>>
>> On Fri, 9 May 2014, Gordon K Smyth wrote:
>>
>>  Dear Yanzhu,
>>>
>>> Please send me your data offlist so that we can trouble-shoot.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> Date: Wed,  7 May 2014 11:29:40 -0700 (PDT)
>>>> From: "Yanzhu [guest]" <guest at bioconductor.org>
>>>> To: bioconductor at r-project.org, mlinyzh at gmail.com
>>>> Subject: [BioC] edgeR: dispersion estimation
>>>>
>>>> Dear Gordon,
>>>>
>>>> Sorry for replying late, I am sick and take some days off. The following
>>>> are my answers to your questions:
>>>>
>>>> I have checked they are all integers, they are read counts generated by
>>>> HT-Seq (v0.5.3p9). When I type your command I get zero as below:
>>>>
>>>> max(y$counts-floor(y$counts))
>>>> [1] 0
>>>>
>>>>
>>>> I have also posted the session information as below. Both the R and
>>>> edgeR package are the most recent versions. However, the results are still
>>>> the same as before.
>>>>
>>>> Are there other possible reasons that would cause such issue?
>>>>
>>>>
>>>> Thank you very much for your help! I greatly appreciate it.
>>>>
>>>>
>>>>
>>>> Yanzhu
>>>>
>>>>
>>>> ########################################
>>>> Dear Yanzhu,
>>>>
>>>> I find it hard to believe that your data are all integers, because there
>>>> is something seriously wrong, and we have only ever seen the problem you
>>>> report when there are fractional counts.
>>>>
>>>> How have you checked that they are all integers?  How were the counts
>>>> created?  What do you see if you type:
>>>>
>>>>   max(y$counts-floor(y$counts))
>>>>
>>>> It also isn't clear that you have installed the current version of edgeR.
>>>> Have you installed R 3.1.0?  Can you please give the sessionInfo()
>>>> output?
>>>> The current version of edgeR is 3.6.1, whereas you seem to be using
>>>> version 3.2.4, which is two Bioconductor releases out of date.
>>>>
>>>> Gordon
>>>>
>>>>
>>>>
>>>> -- output of sessionInfo():
>>>>
>>>> sessionInfo()
>>>>>
>>>> R version 3.1.0 (2014-04-10)
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>>>> States.1252    LC_MONETARY=English_United States.1252
>>>> [4] LC_NUMERIC=C                           LC_TIME=English_United
>>>> States.1252
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> base
>>>>
>>>> other attached packages:
>>>> [1] DESeq_1.16.0         lattice_0.20-29      locfit_1.5-9.1
>>>> Biobase_2.24.0       BiocGenerics_0.10.0
>>>> [6] edgeR_3.6.1          limma_3.20.1         BiocInstaller_1.14.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] annotate_1.42.0      AnnotationDbi_1.26.0 DBI_0.2-7
>>>> genefilter_1.46.0    geneplotter_1.42.0
>>>> [6] GenomeInfoDb_1.0.2   grid_3.1.0           IRanges_1.22.6
>>>> RColorBrewer_1.0-5   RSQLite_0.11.4
>>>> [11] splines_3.1.0        stats4_3.1.0         survival_2.37-7
>>>> tools_3.1.0          XML_3.98-1.1
>>>> [16] xtable_1.7-3
>>>>
>>>
>>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
>>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list