[BioC] EdgeR - problems running estimateCRDisp
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Nov 28 06:34:19 CET 2010
Dear Josquin,
We have committed a major revision of edgeR, version 2.0.0, to
Bioconductor that runs successfully on all your data without errors.
Previously we simply relied on the glm code in the stats package of R to
fit the negative binomial distribution to one gene at a time. Now we have
written new code designed to fit thousands of genes simultaneously. The
new code implements a line search to prevent divergence, hence it now
works on all your data. An added bonus is that the new code is a couple
of orders of magnitude faster than the old.
Best wishes
Gordon
On Fri, 5 Nov 2010, Gordon K Smyth wrote:
> Dear Josquin,
>
> Thanks for providing some data. We've confirmed the problem is that the glm
> functions provided with R fail to converge, when used to fit a negative
> binomial glm, to some of the rows in your data, especially rows with very
> large counts. The problem affects glmFit and glmLRT, as well as
> estimateCRDisp, because they all make calls to glm.fit.
>
> We're writing some new glm code, implementing convergence checking etc, that
> should solve the problem. This will be incorporated into the edgeR package
> during the next week.
>
> Regards
> Gordon
>
> On Sun, 24 Oct 2010, Gordon K Smyth wrote:
>
>> Dear Josquin,
>>
>> estimateCDDisp() has worked on all the datasets we have tested it out on so
>> far (although that's admitedly a very limited number), so we are not able
>> to reproduce your problems. The function is liable to convergence issues
>> however because it needs to fit a very large number of generalized linear
>> models, one or more of which could conceivably fail to converge. That might
>> be the problem with your data. I don't see any problems with the code
>> example that you give, you seem to be using the functions correctly, so the
>> issue must be with the count data itself. Would you be willing to share
>> your data with us offline so that we diagnose and perhaps solve the
>> problem?
>>
>> To see estimateCDDisp() working as intended, you can run the example:
>>
>> library(edgeR)
>> example(glmFit)
>> CR <- estimateCRDisp(d, design)
>> fit <- glmFit(d, design, dispersion=CR$CR.common.dispersion)
>> results <- glmLRT(d, fit, coef=2)
>> topTags(results)
>>
>> Best wishes
>> Gordon
>>
>>
>> ----- ORIGINAL MESSAGE -------------
>> [BioC] EdgeR - problems running estimateCRDisp
>> josquin.tibbits at dpi.vic.gov.au josquin.tibbits at dpi.vic.gov.au
>> Fri Oct 22 07:44:00 CEST 2010
>>
>> I am running the new EdgeR package 1.8 and have updated R to 1.12.0 and
>> updated all the packages I have installed.
>>
>> I have an RNAseq experiment with 24 samples and am wanting to run a glm
>> analysis using the Cox-Reid common dispersion (and tagwise) paramaters. I
>> have created a DGEList object using a targets file and this successfully
>> ran in the calcNormFactors() and in the plotMDS.dge() functions. I have
>> also created what appears to be a sensible and ok design matrix using
>> model.matrix(). These then are being input to estimateCRDisp() to
>> estimate the common dispersion. The code I have run is as follows (in
>> red) with output (blue) which is pretty much exactly the same as the
>> worked example in the manual on pp 51-- and does not seem abnormal at
>> all......
>>
>> targets <- read.delim("Lp_NUE_DGEbymSEQ_TARGETS_EdgeRIN.txt",
>> stringsAsFactors = FALSE, row.names = "label")
>> targets$Nitrogen <- factor(targets$Nitrogen)
>> targets$Endophyte <- factor(targets$Endophyte)
>> targets$Tissue <- factor(targets$Tissue)
>> targets
>>
>> ## CREATE THE DGEList Object
>>
>> DGEList.object <- readDGE(targets, skip = 1, comment.char = '#')
>> colnames(DGEList.object) <- row.names(targets)
>>
>>
>>
>> ## CALCULATE THE NORMALISATION (f) FACTORS
>>
>> DGEList.object <- calcNormFactors(DGEList.object)
>>
>> ## Exclude data where rowsums are < 30 YOU CAN CHANGE THIS (there being 24
>> samples in the experiment)
>>
>> DGEList.object <- DGEList.object[rowSums(DGEList.object$counts) > 30, ]
>> DGEList.object
>>
>> An object of class "DGEList"
>> $samples
>> files Endophyte Tissue Nitrogen
>> lib.size norm.factors
>> Lp_NUE_LFFULL Lp_NUE_LFFULL_ER_RAWREADS.txt Free Leaf Full
>> 3786614 0.9244799
>> Lp_NUE_LFNO3 Lp_NUE_LFNO3_ER_RAWREADS.txt Free Leaf Partial
>> 5311856 0.8120410
>> Lp_NUE_LFK Lp_NUE_LFK_ER_RAWREADS.txt Free Leaf Full
>> 3482772 0.7224960
>> Lp_NUE_LFPO4 Lp_NUE_LFPO4_ER_RAWREADS.txt Free Leaf Full
>> 6208397 0.7367309
>> Lp_NUE_LFCa Lp_NUE_LFCa_ER_RAWREADS.txt Free Leaf Full
>> 3597004 0.8147619
>> 19 more rows ...
>>
>> $counts
>> Lp_NUE_LFFULL Lp_NUE_LFNO3 Lp_NUE_LFK Lp_NUE_LFPO4
>> Lp_NUE_LFCa Lp_NUE_LFNH4 Lp_NUE_RFFULL Lp_NUE_RFNO3 Lp_NUE_RFK
>> Lp_NUE_RFPO4
>>> prg_cds_44 345 382 120 261 141
>> 215 888 717 698 1112
>>> prg_cds_141 290 451 126 422 235
>> 459 676 559 474 879
>>> prg_cds_2 233 277 131 333 239
>> 376 246 185 124 282
>>> prg_cds_16 433 608 184 614 346
>> 423 924 756 595 913
>>> prg_cds_84 287 353 90 241 248
>> 394 823 612 989 1061
>> Lp_NUE_RFCa Lp_NUE_RFNH4 Lp_NUE_LSFULL Lp_NUE_LSNO3
>> Lp_NUE_LSK Lp_NUE_LSPO4 Lp_NUE_LSCa Lp_NUE_LSNH4 Lp_NUE_RSFULL
>> Lp_NUE_RSNO3
>>> prg_cds_44 1107 439 954 257 205
>> 107 868 106 316 1147
>>> prg_cds_141 583 1052 773 325 230
>> 193 600 264 414 1046
>>> prg_cds_2 235 454 177 232 177
>> 203 138 239 334 228
>>> prg_cds_16 817 1040 2373 464 220
>> 276 895 291 714 2655
>>> prg_cds_84 890 1653 894 268 169
>> 123 714 209 311 1078
>> Lp_NUE_RSK Lp_NUE_RSPO4 Lp_NUE_RSCa Lp_NUE_RSNH4
>>> prg_cds_44 622 903 114 474
>>> prg_cds_141 505 711 205 1166
>>> prg_cds_2 93 136 157 340
>>> prg_cds_16 1398 1660 311 1258
>>> prg_cds_84 893 771 154 1269
>> 63905 more rows ...
>>
>>
>> ## Produce an MDS plot of the data
>>
>> #plotMDS.dge(DGEList.object, main = "MDS plot for Lp_NUE_DGEbymSEQ")
>>
>>
>> ## DEFINE THE DESIGN MATRIX
>>
>> design <- model.matrix(~ Endophyte + Tissue + Nitrogen , data = targets)
>> print("This is the design matrix")
>> design
>>
>> (Intercept) EndophyteSTWT TissueRoot NitrogenPartial
>> Lp_NUE_LFFULL 1 0 0 0
>> Lp_NUE_LFNO3 1 0 0 1
>> Lp_NUE_LFK 1 0 0 0
>> Lp_NUE_LFPO4 1 0 0 0
>> Lp_NUE_LFCa 1 0 0 0
>> Lp_NUE_LFNH4 1 0 0 1
>> Lp_NUE_RFFULL 1 0 1 0
>> Lp_NUE_RFNO3 1 0 1 1
>> Lp_NUE_RFK 1 0 1 0
>> Lp_NUE_RFPO4 1 0 1 0
>> Lp_NUE_RFCa 1 0 1 0
>> Lp_NUE_RFNH4 1 0 1 1
>> Lp_NUE_LSFULL 1 1 0 0
>> Lp_NUE_LSNO3 1 1 0 1
>> Lp_NUE_LSK 1 1 0 0
>> Lp_NUE_LSPO4 1 1 0 0
>> Lp_NUE_LSCa 1 1 0 0
>> Lp_NUE_LSNH4 1 1 0 1
>> Lp_NUE_RSFULL 1 1 1 0
>> Lp_NUE_RSNO3 1 1 1 1
>> Lp_NUE_RSK 1 1 1 0
>> Lp_NUE_RSPO4 1 1 1 0
>> Lp_NUE_RSCa 1 1 1 0
>> Lp_NUE_RSNH4 1 1 1 1
>> attr(,"assign")
>> [1] 0 1 2 3
>> attr(,"contrasts")
>> attr(,"contrasts")$Endophyte
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")$Tissue
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")$Nitrogen
>> [1] "contr.treatment"
>>
>>
>> ## ESTIMATE THE Cox-Reid common dispersion
>>
>> DGEList.object <- estimateCRDisp(DGEList.object, design)
>> names(DGEList.object)
>>
>>
>> This step is giving the following error:::
>>
>>
>> Loading required package: MASS
>> Error: NA/NaN/Inf in foreign function call (arg 1)
>> In addition: Warning messages:
>> 1: glm.fit: algorithm did not converge
>> 2: step size truncated due to divergence
>>> names(DGEList.object)
>> [1] "samples" "counts"
>>
>> Clearly the function is not working. I have tried two different computers
>> and a run it in 64 bit but always with the same result. I have also tried
>> using a balanced design and a simple design all with the same results. I
>> have used the DGEList object to analyse the data using the original
>> estimateCommonDisp() and exactTest() which worked fine.
>>
>>
>> I was hoping someone might be able to help resolve this for me????
>>
>>
>> Thanks in advance....................... Josquin
>>
>> The sessionInfo() and traceback() are below:
>>
>>
>>> sessionInfo()
>> R version 2.12.0 (2010-10-15)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
>> LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
>> [5] LC_TIME=English_Australia.1252
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] MASS_7.3-8 edgeR_1.8.1
>>
>> loaded via a namespace (and not attached):
>> [1] limma_3.6.0
>>
>>
>>> traceback()
>> 4: .Fortran("dqrls", qr = x[good, ] * w, n = ngoodobs, p = nvars,
>> y = w * z, ny = 1L, tol = min(1e-07, control$epsilon/1000),
>> coefficients = double(nvars), residuals = double(ngoodobs),
>> effects = double(ngoodobs), rank = integer(1L), pivot = 1L:nvars,
>> qraux = double(nvars), work = double(2 * nvars), PACKAGE = "base")
>> 3: glm.fit(design, y[i, ], offset = offset[i, ], family = f)
>> 2: adjustedProfileLik(spline.disp[i], y.select, design = design,
>> offset = offset.mat.select + lib.size.mat.select)
>> 1: estimateCRDisp(DGEList.object, design)
>> Notice:
>> This email and any attachments may contain information t...{{dropped:14}}
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list