[BioC] edgeR glm fit error
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Mar 1 00:48:51 CET 2011
Hi Colin,
Unfortunately I can't characterize exactly which transcripts the glm code
will fail for. Basically it is liable to fail for highly variable data
for which the log-linear model fits very poorly, like the two examples you
give.
I've just run an overall test (between all the time points) on your data,
and I find most of the transcripts in the your data set to be
differentially expressed, at high level of significance. There's so much
differential expression that I suggest you go with the pairwise exact
tests.
Best wishes
Gordon
On Sun, 27 Feb 2011, Colin Maxwell wrote:
> Hello Gordon,
> Thanks for taking the time to respond. I wanted to fit the design matrix
> like that to get an overall statistic for whether the gene was
> differentially expressed across the time series. If there's a way to get
> that out of the pairwise comparisions, I'd be glad to try it out,
> otherwise being able to fit the glm would be nice.
>
> The transcripts that the algorithm is choking on seem to be very strange ones:
>
> BC2 BC5 BC13 BC3 BC8 BC1 BC6 BC4 BC7 BC16
> 19422 32795 0 0 0 21684 0 10843 8111 0 0
> 19362 3835 0 0 0 0 0 6930 0 0 0
>
> To my eye those look like bad data points, and I'd want to exclude them.
> But do you know why exactly the algorithm would fail to converge when
> trying to fit to them? I'm thinking if there were a rule I could apply I
> could probably just drop them and save you the trouble of writing new
> code.
>
>
> -Colin
>
>
> On Feb 27, 2011, at 9:26 PM, Gordon K Smyth wrote:
>
>> Hi Colin,
>>
>> Thanks for the nice reproducible data example.
>>
>> The problem that you see is caused by lack of convergence of the
>> interative glm algorithm for some of your transcripts. Here are two
>> possible solutions.
>>
>> Firstly, your design has the form of a one-way layout, so you could
>> analyse your data using classic exact edgeR instead of the glm code.
>> In this style of analysis, you would make pairwise comparisons between
>> the time points. I have tried this one your data, and it works fine:
>>
>> > times <- factor(times,levels=c("zero","one","three","six"))
>> > d$samples$group <- times
>> > d2 <- estimateCommonDisp(d)
>> > d2$common.disp
>> [1] 0.2359880
>> > d2 <- estimateTagwiseDisp(d2)
>> Using grid search to estimate tagwise dispersion.
>> > topTags(exactTest(d2,c("zero","one"),common=FALSE))
>> Comparison of groups: one-zero
>> logConc logFC PValue FDR
>> 19847 -29.88350 40.265106 7.849699e-50 1.458945e-45
>> 6370 -31.78645 36.459199 8.076247e-37 7.505257e-33
>> 20014 -31.57373 36.884641 1.569212e-36 7.655628e-33
>> 13189 -16.39203 -17.244752 1.647612e-36 7.655628e-33
>> 3387 -31.89198 36.248152 6.213699e-36 2.309756e-32
>> 6146 -19.46744 11.313569 4.111695e-31 1.273666e-27
>> 17831 -30.38242 -39.267266 9.207825e-31 2.444809e-27
>> 791 -12.50480 7.443195 8.829879e-25 2.051402e-21
>> 11698 -16.87556 7.510317 1.779005e-24 3.673842e-21
>> 17562 -31.98794 -36.056237 2.586138e-23 4.806596e-20
>>
>> If you really do need the glm functionality, because you want to do
>> something other than pairwise comparisons, then we will have to make
>> the glm code work for you. We will probably never succeed in writing
>> glm fitting code that converges for every possible data set, yet for
>> your data we can make the functions work if (i) we recognise that the
>> design is a oneway layout or (ii) use better starting values. If you
>> need this, let us know and we will send you some newer code.
>>
>> Best wishes
>> Gordon
>>
>>
>>> Date: Sat, 26 Feb 2011 15:08:28 -0500
>>> From: Colin Maxwell <csm29 at duke.edu>
>>> To: bioconductor at r-project.org
>>> Subject: [BioC] edgeR glm fit error
>>>
>>> Hello,
>>> I'm trying to use edgeR to analyze some RNA-seq time series data. I have
>>> four time points. The first and last time points have three replicates,
>>> while the middle two have two replicates. The following gives the error I
>>> get:
>>>
>>> require(edgeR)
>>> counts <- read.csv("http://www.duke.edu/~csm29/counts.csv",row.names=1)<
>>> d <- DGEList(counts)
>>> d <- calcNormFactors(d)
>>> d <- d[rowSums(d$counts)>9,]
>>> times <- rep(c("zero","one","three","six"),c(3,2,2,3))
>>> design <- model.matrix(~factor(times))
>>> d <- estimateCRDisp(d, design)
>>>
>>> Error in beta[k, ] <- betaj[decr, ] :
>>> NAs are not allowed in subscripted assignments
>>>
>>> traceback()
>>> 3: mglmLS(y, design, dispersion, offset = offset)
>>> 2: adjustedProfileLik(spline.disp[i], y.filt, design = design, offset =
>>> offset.mat.filt +
>>> lib.size.mat.filt)
>>> 1: estimateCRDisp(d, design)
>>>
>>> sessionInfo()
>>> R version 2.12.0 (2010-10-15)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] edgeR_2.0.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] limma_3.6.9
>>>
>>> The function seems to be getting hung up on one or two genes when I recover
>>> the error. However, when I remove those genes from the data, the problem is
>>> still there. Any help would be much appreciated!
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list