[BioC] Gene expression analysis with edgeR with a large, nested design matrix
Ulrich Braunschweig
u.braunschweig at utoronto.ca
Tue May 27 00:34:17 CEST 2014
Dear Gordon,
I tried both (on a reduced design matrix), with quite similar results.
The quasi-likelihood approach seems a little more conservative at least
with borderline significant genes, but since limma/voom is still orders
of magnitude faster, that's what I will use.
[I presume in the quasi-likelihood pipeline, the coefficients have to be
specified in glmQLFTest() rather than topTags()]
Thanks very much for your help!
Uli
On 24/05/14 20:57, Gordon K Smyth wrote:
> Dear Uli,
>
> edgeR is probably the fastest of the glm negative binomial packages,
> as we have done a lot of work moving all the fitting code to the C++
> level. Nevertheless, the ususal glm pipeline will become very slow
> when there are over 3000 samples and the design matrix has 1500 columns.
>
> Here are two other possibilities. First you could switch to voom
> instead. I ran voom on your simulated data example in a few minutes on
> my laptop computer:
>
> y <- DGEList(counts = reads)
> y <- calcNormFactors(y)
> v <- voom(y,expdesign,plot=TRUE)
> fit <- lmFit(v,expdesign)
> fit <- eBayes(fit)
> topTable(fit, coef=5)
>
> and so on. Alternatively, you would stick to edgeR and use the
> quasi-likelihood pipeline:
>
> y <- DGEList(counts = reads)
> y <- calcNormFactors(y)
> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson")
> ql <- glmQLFTest(y,expdesign)
> topTags(ql,coef=5)
>
> etc. The glmQLFTest() function will take more time than voom but less
> time than estimateGLMCcommonDispersion on your data. The common
> dispersion step is optional in this pipeline -- I include it because
> you've already done it.
>
> Either of these approaches have good statistical motivations. See the
> help pages for voom and glmQLFTest for references to published papers
> describing them.
>
> Best wishes
> Gordon
>
>> Date: Fri, 23 May 2014 09:06:21 -0700 (PDT)
>> From: "Uli Braunschweig [guest]" <guest at bioconductor.org>
>> To: bioconductor at r-project.org, u.braunschweig at utoronto.ca
>> Subject: [BioC] Gene expression analysis with edgeR with a large,
>> nested design matrix
>>
>> Hi All,
>>
>> My problem is the following:
>> I have expression counts for 50 genes in an RNAi screen with 1,536
>> treatments (which includes positive and negative controls, so really
>> 1,416 unique treatments) in two replicates, done in 96-well format
>> (2x16 plates). I know that plate effects and edge effects (whether a
>> well was located on the edge of a plate) are significant, so the
>> design should include treatment, plate, and location (edge or
>> interior). Locations are identical between replicates. Each plate has
>> two negative controls ("siNT"), as well as other controls.
>>
>> I am only interested in the contrasts of each of the treatments vs.
>> the "siNT" control. I thought that the model of edgeR would be useful
>> to score significant hits while at the same time dealing with the
>> mentioned technical biases in a meaningful manner. However, I've had
>> to kill the analysis because estimating the trended and tag-wise
>> dispersions takes excessively long.
>>
>> My question is: Is it even feasible to try to adress a problem with a
>> design that has so few genes and so many treatments using edgeR (or
>> DESeq2)?
>>
>> library(edgeR)
>>
>> ## Data look similar to this:
>> reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50) # dense
>> matrix of 50 genes x (1536 treatments in duplicate)
>>
>> ## Here is how I create my design factors:
>> # (I've left out 'replicate' because it is a linear combination of
>> plates 1-16 and 17-32)
>> rows <- rep(rep(1:8, each=12), 32)
>> cols <- rep(rep(1:12, 8), 32)
>> plate <- rep(1:32, each=96)
>>
>> type.nt <- rows == 1 & cols == 1 | # the negative control to
>> compare everything to; 2 per plate
>> rows == 4 & cols == 9
>> type.posCtl <- rows == 4 & cols == 3 | # positive control; 2 per
>> plate
>> rows == 5 & cols == 9
>> type.mock <- rows == 3 & cols == 3 | # another control; 2 per plate
>> rows == 8 & cols == 12
>> type.empty <- plate %in% c(16, 32) & ( # another control; a bunch
>> on only 2 plates
>> rows %in% c(2:3,6:8) & cols == 9 |
>> cols %in% c(10,11) |
>> rows %in% 1:7 & cols == 12
>> )
>> type.edge <- rows %in% c(1,8) | cols %in% c(1,12) # position on the
>> plate
>>
>> treat <- rep(paste("T", 1:1536, sep=""), 2) # treatments
>> treat[type.nt] <- "siNT"
>> treat[type.mock] <- "mock"
>> treat[type.empty] <- "empty"
>> treat[type.posCtl] <- "siPosCtl"
>> treatfac <- relevel(as.factor(treat), ref="siNT")
>>
>> edgefac <- as.factor(ifelse(type.edge, yes="edge", no="interior"))
>> platefac <- as.factor(paste("P", plate, sep=""))
>>
>> expfact <- data.frame(treatment = treatfac,
>> platepos = edgefac,
>> plate = platefac
>> )
>>
>> expdesign <- model.matrix(formula(~ treatment + plate + platepos),
>> data=expfact)
>>
>> ## Estimating the dispersions
>> y <- DGEList(counts = reads) # reads is the 50x3072 matrix
>> y <- calcNormFactors(y)
>> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson",
>> verbose=TRUE) # faster than Cox-Reid and probably ok since there are
>> many treatments
>> y <- estimateGLMTrendedDisp(y, design=expdesign)
>> y <- estimateGLMTagwiseDisp(y, design=expdesign)
>> ## Neither of the last two steps finish running in a day; same for
>> estimateGLMCommonDisp() if method="CoxReid"
>>
>> I was then hoping to extract the contrasts of each treatment against
>> the "siNT" control.
>> Would it make sense to combine the two technical factors, or subset
>> the count and design matrices for each treatment in a way that
>> reduces the number of treatments, and run them separately?
>> Alternatively, I thought of doing the analysis separately for each
>> treatment using the whole count matrix but amalgamating all other
>> non-control treatments in an "other" group". This seems feasible when
>> run in parallel, but it would be overkill...
>> Any suggestions on how to proceed?
>>
>> Kind regards,
>> Uli
>>
>> --
>> Ulrich Braunschweig, PhD
>>
>> The Donnelly Centre
>> University of Toronto
>> 160 College Street, Room 1030
>> Toronto, Ontario
>> Canada M5S 3E1
>>
>> u.braunschweig at utoronto.ca
>>
>> -- output of sessionInfo():
>>
>> R version 3.1.0 (2014-04-10)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_IE.UTF-8 LC_COLLATE=en_CA.UTF-8
>> [5] LC_MONETARY=en_IE.UTF-8 LC_MESSAGES=en_CA.UTF-8
>> [7] LC_PAPER=en_IE.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] edgeR_3.6.2 limma_3.20.4
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}
More information about the Bioconductor
mailing list