[BioC] EdgeR - problems running estimateCRDisp

Davis McCarthy dmccarthy at wehi.EDU.AU
Thu Dec 9 07:52:48 CET 2010


Dear Josquin

Just to follow up here: yesterday I committed an update to the release and devel versions of edgeR to fix a big related to the use of offsets in the GLMs. I strongly recommend that you and other edgeR users update to the latest version of the package, especially if you are using the GLM functions.

Best regards
Davis


On Nov 28, 2010, at 4:34 PM, Gordon K Smyth wrote:

> Dear Josquin,
> 
> We have committed a major revision of edgeR, version 2.0.0, to Bioconductor that runs successfully on all your data without errors.
> 
> Previously we simply relied on the glm code in the stats package of R to fit the negative binomial distribution to one gene at a time.  Now we have written new code designed to fit thousands of genes simultaneously.  The new code implements a line search to prevent divergence, hence it now works on all your data.  An added bonus is that the new code is a couple of orders of magnitude faster than the old.
> 
> Best wishes
> Gordon
> 
> On Fri, 5 Nov 2010, Gordon K Smyth wrote:
> 
>> Dear Josquin,
>> 
>> Thanks for providing some data.  We've confirmed the problem is that the glm functions provided with R fail to converge, when used to fit a negative binomial glm, to some of the rows in your data, especially rows with very large counts.  The problem affects glmFit and glmLRT, as well as estimateCRDisp, because they all make calls to glm.fit.
>> 
>> We're writing some new glm code, implementing convergence checking etc, that should solve the problem.  This will be incorporated into the edgeR package during the next week.
>> 
>> Regards
>> Gordon
>> 
>> On Sun, 24 Oct 2010, Gordon K Smyth wrote:
>> 
>>> Dear Josquin,
>>> estimateCDDisp() has worked on all the datasets we have tested it out on so far (although that's admitedly a very limited number), so we are not able to reproduce your problems.  The function is liable to convergence issues however because it needs to fit a very large number of generalized linear models, one or more of which could conceivably fail to converge. That might be the problem with your data.  I don't see any problems with the code example that you give, you seem to be using the functions correctly, so the issue must be with the count data itself.  Would you be willing to share your data with us offline so that we diagnose and perhaps solve the problem?
>>> To see estimateCDDisp() working as intended, you can run the example:
>>> 
>>> library(edgeR)
>>> example(glmFit)
>>> CR <- estimateCRDisp(d, design)
>>> fit <- glmFit(d, design, dispersion=CR$CR.common.dispersion)
>>> results <- glmLRT(d, fit, coef=2)
>>> topTags(results)
>>> Best wishes
>>> Gordon
>>> ----- ORIGINAL MESSAGE -------------
>>> [BioC] EdgeR - problems running estimateCRDisp
>>> josquin.tibbits at dpi.vic.gov.au josquin.tibbits at dpi.vic.gov.au
>>> Fri Oct 22 07:44:00 CEST 2010
>>> I am running the new EdgeR package 1.8 and have updated R to 1.12.0 and
>>> updated all the packages I have installed.
>>> I have an RNAseq experiment with 24 samples and am wanting to run a glm
>>> analysis using the Cox-Reid common dispersion (and tagwise) paramaters.  I
>>> have created a DGEList object using a targets file and this successfully
>>> ran in the calcNormFactors() and in the plotMDS.dge() functions. I have
>>> also created what appears to be a sensible and ok design matrix using
>>> model.matrix().  These then are being input to estimateCRDisp() to
>>> estimate the common dispersion.  The code I have run is as follows (in
>>> red) with output (blue) which is pretty much exactly the same as the
>>> worked example in the manual on pp 51-- and does not seem abnormal at
>>> all......
>>> targets <- read.delim("Lp_NUE_DGEbymSEQ_TARGETS_EdgeRIN.txt",
>>> stringsAsFactors = FALSE, row.names = "label")
>>> targets$Nitrogen        <- factor(targets$Nitrogen)
>>> targets$Endophyte       <- factor(targets$Endophyte)
>>> targets$Tissue <- factor(targets$Tissue)
>>> targets
>>> ## CREATE THE DGEList Object
>>> DGEList.object <- readDGE(targets, skip = 1, comment.char = '#')
>>> colnames(DGEList.object) <- row.names(targets)
>>> ## CALCULATE THE NORMALISATION (f) FACTORS
>>> DGEList.object <- calcNormFactors(DGEList.object)
>>> ## Exclude data where rowsums are < 30 YOU CAN CHANGE THIS (there being 24
>>> samples in the experiment)
>>> DGEList.object <- DGEList.object[rowSums(DGEList.object$counts) > 30, ]
>>> DGEList.object
>>> An object of class "DGEList"
>>> $samples
>>>                                      files Endophyte Tissue Nitrogen
>>> lib.size norm.factors
>>> Lp_NUE_LFFULL  Lp_NUE_LFFULL_ER_RAWREADS.txt      Free   Leaf     Full
>>> 3786614    0.9244799
>>> Lp_NUE_LFNO3    Lp_NUE_LFNO3_ER_RAWREADS.txt      Free   Leaf  Partial
>>> 5311856    0.8120410
>>> Lp_NUE_LFK        Lp_NUE_LFK_ER_RAWREADS.txt      Free   Leaf     Full
>>> 3482772    0.7224960
>>> Lp_NUE_LFPO4    Lp_NUE_LFPO4_ER_RAWREADS.txt      Free   Leaf     Full
>>> 6208397    0.7367309
>>> Lp_NUE_LFCa      Lp_NUE_LFCa_ER_RAWREADS.txt      Free   Leaf     Full
>>> 3597004    0.8147619
>>> 19 more rows ...
>>> $counts
>>>            Lp_NUE_LFFULL  Lp_NUE_LFNO3  Lp_NUE_LFK  Lp_NUE_LFPO4
>>> Lp_NUE_LFCa Lp_NUE_LFNH4  Lp_NUE_RFFULL   Lp_NUE_RFNO3  Lp_NUE_RFK
>>> Lp_NUE_RFPO4
>>>> prg_cds_44             345           382         120           261  141
>>>    215             888           717         698         1112
>>>> prg_cds_141            290           451         126           422  235
>>>    459             676           559         474          879
>>>> prg_cds_2              233           277         131           333  239
>>>    376             246           185         124          282
>>>> prg_cds_16             433           608         184           614  346
>>>    423             924           756         595          913
>>>> prg_cds_84             287           353          90           241  248
>>>    394             823           612         989         1061
>>>            Lp_NUE_RFCa  Lp_NUE_RFNH4  Lp_NUE_LSFULL  Lp_NUE_LSNO3
>>> Lp_NUE_LSK Lp_NUE_LSPO4 Lp_NUE_LSCa  Lp_NUE_LSNH4  Lp_NUE_RSFULL
>>> Lp_NUE_RSNO3
>>>> prg_cds_44          1107           439            954           257  205
>>>      107          868           106             316         1147
>>>> prg_cds_141          583          1052            773           325  230
>>>      193          600           264             414         1046
>>>> prg_cds_2            235           454            177           232  177
>>>      203          138           239             334          228
>>>> prg_cds_16           817          1040           2373           464  220
>>>      276          895           291             714         2655
>>>> prg_cds_84           890          1653            894           268  169
>>>      123          714           209             311         1078
>>>            Lp_NUE_RSK  Lp_NUE_RSPO4  Lp_NUE_RSCa  Lp_NUE_RSNH4
>>>> prg_cds_44          622           903          114          474
>>>> prg_cds_141         505           711          205         1166
>>>> prg_cds_2            93           136          157          340
>>>> prg_cds_16         1398          1660          311         1258
>>>> prg_cds_84          893           771          154         1269
>>> 63905 more rows ...
>>> ## Produce an MDS plot of the data
>>> #plotMDS.dge(DGEList.object, main = "MDS plot for Lp_NUE_DGEbymSEQ")
>>> ## DEFINE THE DESIGN MATRIX
>>> design <- model.matrix(~ Endophyte + Tissue + Nitrogen , data = targets)
>>> print("This is the design matrix")
>>> design
>>> 
>>>               (Intercept) EndophyteSTWT TissueRoot NitrogenPartial
>>> Lp_NUE_LFFULL             1             0          0               0
>>> Lp_NUE_LFNO3              1             0          0               1
>>> Lp_NUE_LFK                1             0          0               0
>>> Lp_NUE_LFPO4              1             0          0               0
>>> Lp_NUE_LFCa               1             0          0               0
>>> Lp_NUE_LFNH4              1             0          0               1
>>> Lp_NUE_RFFULL             1             0          1               0
>>> Lp_NUE_RFNO3              1             0          1               1
>>> Lp_NUE_RFK                1             0          1               0
>>> Lp_NUE_RFPO4              1             0          1               0
>>> Lp_NUE_RFCa               1             0          1               0
>>> Lp_NUE_RFNH4              1             0          1               1
>>> Lp_NUE_LSFULL             1             1          0               0
>>> Lp_NUE_LSNO3              1             1          0               1
>>> Lp_NUE_LSK                1             1          0               0
>>> Lp_NUE_LSPO4              1             1          0               0
>>> Lp_NUE_LSCa               1             1          0               0
>>> Lp_NUE_LSNH4              1             1          0               1
>>> Lp_NUE_RSFULL             1             1          1               0
>>> Lp_NUE_RSNO3              1             1          1               1
>>> Lp_NUE_RSK                1             1          1               0
>>> Lp_NUE_RSPO4              1             1          1               0
>>> Lp_NUE_RSCa               1             1          1               0
>>> Lp_NUE_RSNH4              1             1          1               1
>>> attr(,"assign")
>>> [1] 0 1 2 3
>>> attr(,"contrasts")
>>> attr(,"contrasts")$Endophyte
>>> [1] "contr.treatment"
>>> attr(,"contrasts")$Tissue
>>> [1] "contr.treatment"
>>> attr(,"contrasts")$Nitrogen
>>> [1] "contr.treatment"
>>> ## ESTIMATE THE Cox-Reid common dispersion
>>> DGEList.object <- estimateCRDisp(DGEList.object, design)
>>> names(DGEList.object)
>>> This step is giving  the following error:::
>>> Loading required package: MASS
>>> Error: NA/NaN/Inf in foreign function call (arg 1)
>>> In addition: Warning messages:
>>> 1: glm.fit: algorithm did not converge
>>> 2: step size truncated due to divergence
>>>> names(DGEList.object)
>>> [1] "samples" "counts"
>>> Clearly the function is not working. I have tried two different computers
>>> and a run it in 64 bit but always with the same result. I have also tried
>>> using a balanced design and a simple design all with the same results. I
>>> have used the DGEList object to analyse the data using the original
>>> estimateCommonDisp() and exactTest() which worked fine.
>>> I was hoping someone might be able to help resolve this for me????
>>> Thanks in advance....................... Josquin
>>> The sessionInfo() and traceback() are below:
>>>> sessionInfo()
>>> R version 2.12.0 (2010-10-15)
>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>> locale:
>>> [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
>>> LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
>>> [5] LC_TIME=English_Australia.1252
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> other attached packages:
>>> [1] MASS_7.3-8  edgeR_1.8.1
>>> loaded via a namespace (and not attached):
>>> [1] limma_3.6.0
>>>> traceback()
>>> 4: .Fortran("dqrls", qr = x[good, ] * w, n = ngoodobs, p = nvars,
>>>      y = w * z, ny = 1L, tol = min(1e-07, control$epsilon/1000),
>>>      coefficients = double(nvars), residuals = double(ngoodobs),
>>>      effects = double(ngoodobs), rank = integer(1L), pivot = 1L:nvars,
>>>      qraux = double(nvars), work = double(2 * nvars), PACKAGE = "base")
>>> 3: glm.fit(design, y[i, ], offset = offset[i, ], family = f)
>>> 2: adjustedProfileLik(spline.disp[i], y.select, design = design,
>>>      offset = offset.mat.select + lib.size.mat.select)
>>> 1: estimateCRDisp(DGEList.object, design)
>>> Notice:
>>> This email and any attachments may contain information t...{{dropped:14}}

------------------------------------------------------------------------
Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
dmccarthy at wehi.edu.au
http://www.wehi.edu.au



______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}



More information about the Bioconductor mailing list