[BioC] EdgeR - problems running estimateCRDisp
Davis McCarthy
dmccarthy at wehi.EDU.AU
Thu Dec 9 07:52:48 CET 2010
Dear Josquin
Just to follow up here: yesterday I committed an update to the release and devel versions of edgeR to fix a big related to the use of offsets in the GLMs. I strongly recommend that you and other edgeR users update to the latest version of the package, especially if you are using the GLM functions.
Best regards
Davis
On Nov 28, 2010, at 4:34 PM, Gordon K Smyth wrote:
> Dear Josquin,
>
> We have committed a major revision of edgeR, version 2.0.0, to Bioconductor that runs successfully on all your data without errors.
>
> Previously we simply relied on the glm code in the stats package of R to fit the negative binomial distribution to one gene at a time. Now we have written new code designed to fit thousands of genes simultaneously. The new code implements a line search to prevent divergence, hence it now works on all your data. An added bonus is that the new code is a couple of orders of magnitude faster than the old.
>
> Best wishes
> Gordon
>
> On Fri, 5 Nov 2010, Gordon K Smyth wrote:
>
>> Dear Josquin,
>>
>> Thanks for providing some data. We've confirmed the problem is that the glm functions provided with R fail to converge, when used to fit a negative binomial glm, to some of the rows in your data, especially rows with very large counts. The problem affects glmFit and glmLRT, as well as estimateCRDisp, because they all make calls to glm.fit.
>>
>> We're writing some new glm code, implementing convergence checking etc, that should solve the problem. This will be incorporated into the edgeR package during the next week.
>>
>> Regards
>> Gordon
>>
>> On Sun, 24 Oct 2010, Gordon K Smyth wrote:
>>
>>> Dear Josquin,
>>> estimateCDDisp() has worked on all the datasets we have tested it out on so far (although that's admitedly a very limited number), so we are not able to reproduce your problems. The function is liable to convergence issues however because it needs to fit a very large number of generalized linear models, one or more of which could conceivably fail to converge. That might be the problem with your data. I don't see any problems with the code example that you give, you seem to be using the functions correctly, so the issue must be with the count data itself. Would you be willing to share your data with us offline so that we diagnose and perhaps solve the problem?
>>> To see estimateCDDisp() working as intended, you can run the example:
>>>
>>> library(edgeR)
>>> example(glmFit)
>>> CR <- estimateCRDisp(d, design)
>>> fit <- glmFit(d, design, dispersion=CR$CR.common.dispersion)
>>> results <- glmLRT(d, fit, coef=2)
>>> topTags(results)
>>> Best wishes
>>> Gordon
>>> ----- ORIGINAL MESSAGE -------------
>>> [BioC] EdgeR - problems running estimateCRDisp
>>> josquin.tibbits at dpi.vic.gov.au josquin.tibbits at dpi.vic.gov.au
>>> Fri Oct 22 07:44:00 CEST 2010
>>> I am running the new EdgeR package 1.8 and have updated R to 1.12.0 and
>>> updated all the packages I have installed.
>>> I have an RNAseq experiment with 24 samples and am wanting to run a glm
>>> analysis using the Cox-Reid common dispersion (and tagwise) paramaters. I
>>> have created a DGEList object using a targets file and this successfully
>>> ran in the calcNormFactors() and in the plotMDS.dge() functions. I have
>>> also created what appears to be a sensible and ok design matrix using
>>> model.matrix(). These then are being input to estimateCRDisp() to
>>> estimate the common dispersion. The code I have run is as follows (in
>>> red) with output (blue) which is pretty much exactly the same as the
>>> worked example in the manual on pp 51-- and does not seem abnormal at
>>> all......
>>> targets <- read.delim("Lp_NUE_DGEbymSEQ_TARGETS_EdgeRIN.txt",
>>> stringsAsFactors = FALSE, row.names = "label")
>>> targets$Nitrogen <- factor(targets$Nitrogen)
>>> targets$Endophyte <- factor(targets$Endophyte)
>>> targets$Tissue <- factor(targets$Tissue)
>>> targets
>>> ## CREATE THE DGEList Object
>>> DGEList.object <- readDGE(targets, skip = 1, comment.char = '#')
>>> colnames(DGEList.object) <- row.names(targets)
>>> ## CALCULATE THE NORMALISATION (f) FACTORS
>>> DGEList.object <- calcNormFactors(DGEList.object)
>>> ## Exclude data where rowsums are < 30 YOU CAN CHANGE THIS (there being 24
>>> samples in the experiment)
>>> DGEList.object <- DGEList.object[rowSums(DGEList.object$counts) > 30, ]
>>> DGEList.object
>>> An object of class "DGEList"
>>> $samples
>>> files Endophyte Tissue Nitrogen
>>> lib.size norm.factors
>>> Lp_NUE_LFFULL Lp_NUE_LFFULL_ER_RAWREADS.txt Free Leaf Full
>>> 3786614 0.9244799
>>> Lp_NUE_LFNO3 Lp_NUE_LFNO3_ER_RAWREADS.txt Free Leaf Partial
>>> 5311856 0.8120410
>>> Lp_NUE_LFK Lp_NUE_LFK_ER_RAWREADS.txt Free Leaf Full
>>> 3482772 0.7224960
>>> Lp_NUE_LFPO4 Lp_NUE_LFPO4_ER_RAWREADS.txt Free Leaf Full
>>> 6208397 0.7367309
>>> Lp_NUE_LFCa Lp_NUE_LFCa_ER_RAWREADS.txt Free Leaf Full
>>> 3597004 0.8147619
>>> 19 more rows ...
>>> $counts
>>> Lp_NUE_LFFULL Lp_NUE_LFNO3 Lp_NUE_LFK Lp_NUE_LFPO4
>>> Lp_NUE_LFCa Lp_NUE_LFNH4 Lp_NUE_RFFULL Lp_NUE_RFNO3 Lp_NUE_RFK
>>> Lp_NUE_RFPO4
>>>> prg_cds_44 345 382 120 261 141
>>> 215 888 717 698 1112
>>>> prg_cds_141 290 451 126 422 235
>>> 459 676 559 474 879
>>>> prg_cds_2 233 277 131 333 239
>>> 376 246 185 124 282
>>>> prg_cds_16 433 608 184 614 346
>>> 423 924 756 595 913
>>>> prg_cds_84 287 353 90 241 248
>>> 394 823 612 989 1061
>>> Lp_NUE_RFCa Lp_NUE_RFNH4 Lp_NUE_LSFULL Lp_NUE_LSNO3
>>> Lp_NUE_LSK Lp_NUE_LSPO4 Lp_NUE_LSCa Lp_NUE_LSNH4 Lp_NUE_RSFULL
>>> Lp_NUE_RSNO3
>>>> prg_cds_44 1107 439 954 257 205
>>> 107 868 106 316 1147
>>>> prg_cds_141 583 1052 773 325 230
>>> 193 600 264 414 1046
>>>> prg_cds_2 235 454 177 232 177
>>> 203 138 239 334 228
>>>> prg_cds_16 817 1040 2373 464 220
>>> 276 895 291 714 2655
>>>> prg_cds_84 890 1653 894 268 169
>>> 123 714 209 311 1078
>>> Lp_NUE_RSK Lp_NUE_RSPO4 Lp_NUE_RSCa Lp_NUE_RSNH4
>>>> prg_cds_44 622 903 114 474
>>>> prg_cds_141 505 711 205 1166
>>>> prg_cds_2 93 136 157 340
>>>> prg_cds_16 1398 1660 311 1258
>>>> prg_cds_84 893 771 154 1269
>>> 63905 more rows ...
>>> ## Produce an MDS plot of the data
>>> #plotMDS.dge(DGEList.object, main = "MDS plot for Lp_NUE_DGEbymSEQ")
>>> ## DEFINE THE DESIGN MATRIX
>>> design <- model.matrix(~ Endophyte + Tissue + Nitrogen , data = targets)
>>> print("This is the design matrix")
>>> design
>>>
>>> (Intercept) EndophyteSTWT TissueRoot NitrogenPartial
>>> Lp_NUE_LFFULL 1 0 0 0
>>> Lp_NUE_LFNO3 1 0 0 1
>>> Lp_NUE_LFK 1 0 0 0
>>> Lp_NUE_LFPO4 1 0 0 0
>>> Lp_NUE_LFCa 1 0 0 0
>>> Lp_NUE_LFNH4 1 0 0 1
>>> Lp_NUE_RFFULL 1 0 1 0
>>> Lp_NUE_RFNO3 1 0 1 1
>>> Lp_NUE_RFK 1 0 1 0
>>> Lp_NUE_RFPO4 1 0 1 0
>>> Lp_NUE_RFCa 1 0 1 0
>>> Lp_NUE_RFNH4 1 0 1 1
>>> Lp_NUE_LSFULL 1 1 0 0
>>> Lp_NUE_LSNO3 1 1 0 1
>>> Lp_NUE_LSK 1 1 0 0
>>> Lp_NUE_LSPO4 1 1 0 0
>>> Lp_NUE_LSCa 1 1 0 0
>>> Lp_NUE_LSNH4 1 1 0 1
>>> Lp_NUE_RSFULL 1 1 1 0
>>> Lp_NUE_RSNO3 1 1 1 1
>>> Lp_NUE_RSK 1 1 1 0
>>> Lp_NUE_RSPO4 1 1 1 0
>>> Lp_NUE_RSCa 1 1 1 0
>>> Lp_NUE_RSNH4 1 1 1 1
>>> attr(,"assign")
>>> [1] 0 1 2 3
>>> attr(,"contrasts")
>>> attr(,"contrasts")$Endophyte
>>> [1] "contr.treatment"
>>> attr(,"contrasts")$Tissue
>>> [1] "contr.treatment"
>>> attr(,"contrasts")$Nitrogen
>>> [1] "contr.treatment"
>>> ## ESTIMATE THE Cox-Reid common dispersion
>>> DGEList.object <- estimateCRDisp(DGEList.object, design)
>>> names(DGEList.object)
>>> This step is giving the following error:::
>>> Loading required package: MASS
>>> Error: NA/NaN/Inf in foreign function call (arg 1)
>>> In addition: Warning messages:
>>> 1: glm.fit: algorithm did not converge
>>> 2: step size truncated due to divergence
>>>> names(DGEList.object)
>>> [1] "samples" "counts"
>>> Clearly the function is not working. I have tried two different computers
>>> and a run it in 64 bit but always with the same result. I have also tried
>>> using a balanced design and a simple design all with the same results. I
>>> have used the DGEList object to analyse the data using the original
>>> estimateCommonDisp() and exactTest() which worked fine.
>>> I was hoping someone might be able to help resolve this for me????
>>> Thanks in advance....................... Josquin
>>> The sessionInfo() and traceback() are below:
>>>> sessionInfo()
>>> R version 2.12.0 (2010-10-15)
>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>> locale:
>>> [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
>>> LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
>>> [5] LC_TIME=English_Australia.1252
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>> other attached packages:
>>> [1] MASS_7.3-8 edgeR_1.8.1
>>> loaded via a namespace (and not attached):
>>> [1] limma_3.6.0
>>>> traceback()
>>> 4: .Fortran("dqrls", qr = x[good, ] * w, n = ngoodobs, p = nvars,
>>> y = w * z, ny = 1L, tol = min(1e-07, control$epsilon/1000),
>>> coefficients = double(nvars), residuals = double(ngoodobs),
>>> effects = double(ngoodobs), rank = integer(1L), pivot = 1L:nvars,
>>> qraux = double(nvars), work = double(2 * nvars), PACKAGE = "base")
>>> 3: glm.fit(design, y[i, ], offset = offset[i, ], family = f)
>>> 2: adjustedProfileLik(spline.disp[i], y.select, design = design,
>>> offset = offset.mat.select + lib.size.mat.select)
>>> 1: estimateCRDisp(DGEList.object, design)
>>> Notice:
>>> This email and any attachments may contain information t...{{dropped:14}}
------------------------------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
dmccarthy at wehi.edu.au
http://www.wehi.edu.au
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}
More information about the Bioconductor
mailing list