[BioC] Gene expression analysis with edgeR with a large, nested design matrix
Gordon K Smyth
smyth at wehi.EDU.AU
Tue May 27 01:35:54 CEST 2014
On Mon, 26 May 2014, Ulrich Braunschweig wrote:
> Dear Gordon,
>
> I tried both (on a reduced design matrix), with quite similar results. The
> quasi-likelihood approach seems a little more conservative at least with
> borderline significant genes, but since limma/voom is still orders of
> magnitude faster, that's what I will use.
>
> [I presume in the quasi-likelihood pipeline, the coefficients have to be
> specified in glmQLFTest() rather than topTags()]
Yes, that's right.
Gordon
> Thanks very much for your help!
> Uli
>
> On 24/05/14 20:57, Gordon K Smyth wrote:
>> Dear Uli,
>>
>> edgeR is probably the fastest of the glm negative binomial packages, as we
>> have done a lot of work moving all the fitting code to the C++ level.
>> Nevertheless, the ususal glm pipeline will become very slow when there are
>> over 3000 samples and the design matrix has 1500 columns.
>>
>> Here are two other possibilities. First you could switch to voom instead.
>> I ran voom on your simulated data example in a few minutes on my laptop
>> computer:
>>
>> y <- DGEList(counts = reads)
>> y <- calcNormFactors(y)
>> v <- voom(y,expdesign,plot=TRUE)
>> fit <- lmFit(v,expdesign)
>> fit <- eBayes(fit)
>> topTable(fit, coef=5)
>>
>> and so on. Alternatively, you would stick to edgeR and use the
>> quasi-likelihood pipeline:
>>
>> y <- DGEList(counts = reads)
>> y <- calcNormFactors(y)
>> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson")
>> ql <- glmQLFTest(y,expdesign)
>> topTags(ql,coef=5)
>>
>> etc. The glmQLFTest() function will take more time than voom but less time
>> than estimateGLMCcommonDispersion on your data. The common dispersion step
>> is optional in this pipeline -- I include it because you've already done
>> it.
>>
>> Either of these approaches have good statistical motivations. See the help
>> pages for voom and glmQLFTest for references to published papers describing
>> them.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Fri, 23 May 2014 09:06:21 -0700 (PDT)
>>> From: "Uli Braunschweig [guest]" <guest at bioconductor.org>
>>> To: bioconductor at r-project.org, u.braunschweig at utoronto.ca
>>> Subject: [BioC] Gene expression analysis with edgeR with a large,
>>> nested design matrix
>>>
>>> Hi All,
>>>
>>> My problem is the following:
>>> I have expression counts for 50 genes in an RNAi screen with 1,536
>>> treatments (which includes positive and negative controls, so really 1,416
>>> unique treatments) in two replicates, done in 96-well format (2x16
>>> plates). I know that plate effects and edge effects (whether a well was
>>> located on the edge of a plate) are significant, so the design should
>>> include treatment, plate, and location (edge or interior). Locations are
>>> identical between replicates. Each plate has two negative controls
>>> ("siNT"), as well as other controls.
>>>
>>> I am only interested in the contrasts of each of the treatments vs. the
>>> "siNT" control. I thought that the model of edgeR would be useful to score
>>> significant hits while at the same time dealing with the mentioned
>>> technical biases in a meaningful manner. However, I've had to kill the
>>> analysis because estimating the trended and tag-wise dispersions takes
>>> excessively long.
>>>
>>> My question is: Is it even feasible to try to adress a problem with a
>>> design that has so few genes and so many treatments using edgeR (or
>>> DESeq2)?
>>>
>>> library(edgeR)
>>>
>>> ## Data look similar to this:
>>> reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50) # dense matrix of
>>> 50 genes x (1536 treatments in duplicate)
>>>
>>> ## Here is how I create my design factors:
>>> # (I've left out 'replicate' because it is a linear combination of plates
>>> 1-16 and 17-32)
>>> rows <- rep(rep(1:8, each=12), 32)
>>> cols <- rep(rep(1:12, 8), 32)
>>> plate <- rep(1:32, each=96)
>>>
>>> type.nt <- rows == 1 & cols == 1 | # the negative control to
>>> compare everything to; 2 per plate
>>> rows == 4 & cols == 9
>>> type.posCtl <- rows == 4 & cols == 3 | # positive control; 2 per plate
>>> rows == 5 & cols == 9
>>> type.mock <- rows == 3 & cols == 3 | # another control; 2 per plate
>>> rows == 8 & cols == 12
>>> type.empty <- plate %in% c(16, 32) & ( # another control; a bunch on
>>> only 2 plates
>>> rows %in% c(2:3,6:8) & cols == 9 |
>>> cols %in% c(10,11) |
>>> rows %in% 1:7 & cols == 12
>>> )
>>> type.edge <- rows %in% c(1,8) | cols %in% c(1,12) # position on the plate
>>>
>>> treat <- rep(paste("T", 1:1536, sep=""), 2) # treatments
>>> treat[type.nt] <- "siNT"
>>> treat[type.mock] <- "mock"
>>> treat[type.empty] <- "empty"
>>> treat[type.posCtl] <- "siPosCtl"
>>> treatfac <- relevel(as.factor(treat), ref="siNT")
>>>
>>> edgefac <- as.factor(ifelse(type.edge, yes="edge", no="interior"))
>>> platefac <- as.factor(paste("P", plate, sep=""))
>>>
>>> expfact <- data.frame(treatment = treatfac,
>>> platepos = edgefac,
>>> plate = platefac
>>> )
>>>
>>> expdesign <- model.matrix(formula(~ treatment + plate + platepos),
>>> data=expfact)
>>>
>>> ## Estimating the dispersions
>>> y <- DGEList(counts = reads) # reads is the 50x3072 matrix
>>> y <- calcNormFactors(y)
>>> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson",
>>> verbose=TRUE) # faster than Cox-Reid and probably ok since there are many
>>> treatments
>>> y <- estimateGLMTrendedDisp(y, design=expdesign)
>>> y <- estimateGLMTagwiseDisp(y, design=expdesign)
>>> ## Neither of the last two steps finish running in a day; same for
>>> estimateGLMCommonDisp() if method="CoxReid"
>>>
>>> I was then hoping to extract the contrasts of each treatment against the
>>> "siNT" control.
>>> Would it make sense to combine the two technical factors, or subset the
>>> count and design matrices for each treatment in a way that reduces the
>>> number of treatments, and run them separately? Alternatively, I thought of
>>> doing the analysis separately for each treatment using the whole count
>>> matrix but amalgamating all other non-control treatments in an "other"
>>> group". This seems feasible when run in parallel, but it would be
>>> overkill...
>>> Any suggestions on how to proceed?
>>>
>>> Kind regards,
>>> Uli
>>>
>>> --
>>> Ulrich Braunschweig, PhD
>>>
>>> The Donnelly Centre
>>> University of Toronto
>>> 160 College Street, Room 1030
>>> Toronto, Ontario
>>> Canada M5S 3E1
>>>
>>> u.braunschweig at utoronto.ca
>>>
>>> -- output of sessionInfo():
>>>
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_IE.UTF-8 LC_COLLATE=en_CA.UTF-8
>>> [5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_CA.UTF-8
>>> [7] LC_PAPER=en_IE.UTF-8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] edgeR_3.6.2 limma_3.20.4
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list