[BioC] edgeR glm fit error

Gordon K Smyth smyth at wehi.EDU.AU
Tue Mar 1 00:48:51 CET 2011


Hi Colin,

Unfortunately I can't characterize exactly which transcripts the glm code 
will fail for.  Basically it is liable to fail for highly variable data 
for which the log-linear model fits very poorly, like the two examples you 
give.

I've just run an overall test (between all the time points) on your data, 
and I find most of the transcripts in the your data set to be 
differentially expressed, at high level of significance.  There's so much 
differential expression that I suggest you go with the pairwise exact 
tests.

Best wishes
Gordon

On Sun, 27 Feb 2011, Colin Maxwell wrote:

> Hello Gordon,

> Thanks for taking the time to respond. I wanted to fit the design matrix 
> like that to get an overall statistic for whether the gene was 
> differentially expressed across the time series. If there's a way to get 
> that out of the pairwise comparisions, I'd be glad to try it out, 
> otherwise being able to fit the glm would be nice.
>
> The transcripts that the algorithm is choking on seem to be very strange ones:
>
>        BC2 BC5 BC13 BC3   BC8 BC1   BC6  BC4 BC7 BC16
> 19422 32795   0    0   0 21684   0 10843 8111   0    0
> 19362 3835    0    0    0    0    0 6930    0    0    0
>
> To my eye those look like bad data points, and I'd want to exclude them. 
> But do you know why exactly the algorithm would fail to converge when 
> trying to fit to them? I'm thinking if there were a rule I could apply I 
> could probably just drop them and save you the trouble of writing new 
> code.
>
>
> -Colin
>
>
> On Feb 27, 2011, at 9:26 PM, Gordon K Smyth wrote:
>
>> Hi Colin,
>>
>> Thanks for the nice reproducible data example.
>>
>> The problem that you see is caused by lack of convergence of the 
>> interative glm algorithm for some of your transcripts.  Here are two 
>> possible solutions.
>>
>> Firstly, your design has the form of a one-way layout, so you could 
>> analyse your data using classic exact edgeR instead of the glm code. 
>> In this style of analysis, you would make pairwise comparisons between 
>> the time points.  I have tried this one your data, and it works fine:
>>
>> > times <- factor(times,levels=c("zero","one","three","six"))
>> > d$samples$group <- times
>> > d2 <- estimateCommonDisp(d)
>> > d2$common.disp
>>  [1] 0.2359880
>> > d2 <- estimateTagwiseDisp(d2)
>>  Using grid search to estimate tagwise dispersion.
>> > topTags(exactTest(d2,c("zero","one"),common=FALSE))
>>  Comparison of groups: one-zero
>>          logConc      logFC       PValue          FDR
>>  19847 -29.88350  40.265106 7.849699e-50 1.458945e-45
>>  6370  -31.78645  36.459199 8.076247e-37 7.505257e-33
>>  20014 -31.57373  36.884641 1.569212e-36 7.655628e-33
>>  13189 -16.39203 -17.244752 1.647612e-36 7.655628e-33
>>  3387  -31.89198  36.248152 6.213699e-36 2.309756e-32
>>  6146  -19.46744  11.313569 4.111695e-31 1.273666e-27
>>  17831 -30.38242 -39.267266 9.207825e-31 2.444809e-27
>>  791   -12.50480   7.443195 8.829879e-25 2.051402e-21
>>  11698 -16.87556   7.510317 1.779005e-24 3.673842e-21
>>  17562 -31.98794 -36.056237 2.586138e-23 4.806596e-20
>>
>> If you really do need the glm functionality, because you want to do 
>> something other than pairwise comparisons, then we will have to make 
>> the glm code work for you.  We will probably never succeed in writing 
>> glm fitting code that converges for every possible data set, yet for 
>> your data we can make the functions work if (i) we recognise that the 
>> design is a oneway layout or (ii) use better starting values.  If you 
>> need this, let us know and we will send you some newer code.
>>
>> Best wishes
>> Gordon
>>
>>
>>> Date: Sat, 26 Feb 2011 15:08:28 -0500
>>> From: Colin Maxwell <csm29 at duke.edu>
>>> To: bioconductor at r-project.org
>>> Subject: [BioC] edgeR glm fit error
>>>
>>> Hello,
>>> I'm trying to use edgeR to analyze some RNA-seq time series data. I have
>>> four time points. The first and last time points have three replicates,
>>> while the middle two have two replicates. The following gives the error I
>>> get:
>>>
>>> require(edgeR)
>>> counts <- read.csv("http://www.duke.edu/~csm29/counts.csv",row.names=1)<
>>> d <- DGEList(counts)
>>> d <- calcNormFactors(d)
>>> d <- d[rowSums(d$counts)>9,]
>>> times <- rep(c("zero","one","three","six"),c(3,2,2,3))
>>> design <- model.matrix(~factor(times))
>>> d <- estimateCRDisp(d, design)
>>>
>>> Error in beta[k, ] <- betaj[decr, ] :
>>> NAs are not allowed in subscripted assignments
>>>
>>> traceback()
>>> 3: mglmLS(y, design, dispersion, offset = offset)
>>> 2: adjustedProfileLik(spline.disp[i], y.filt, design = design, offset =
>>> offset.mat.filt +
>>>     lib.size.mat.filt)
>>> 1: estimateCRDisp(d, design)
>>>
>>> sessionInfo()
>>> R version 2.12.0 (2010-10-15)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] edgeR_2.0.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] limma_3.6.9
>>>
>>> The function seems to be getting hung up on one or two genes when I recover
>>> the error. However, when I remove those genes from the data, the problem is
>>> still there. Any help would be much appreciated!

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list