[BioC] Gene expression analysis with edgeR with a large, nested design matrix

Gordon K Smyth smyth at wehi.EDU.AU
Tue May 27 01:35:54 CEST 2014


On Mon, 26 May 2014, Ulrich Braunschweig wrote:

> Dear Gordon,
>
> I tried both (on a reduced design matrix), with quite similar results. The 
> quasi-likelihood approach seems a little more conservative at least with 
> borderline significant genes, but since limma/voom is still orders of 
> magnitude faster, that's what I will use.
>
> [I presume in the quasi-likelihood pipeline, the coefficients have to be 
> specified in glmQLFTest() rather than topTags()]

Yes, that's right.

Gordon

> Thanks very much for your help!
> Uli
>
> On 24/05/14 20:57, Gordon K Smyth wrote:
>> Dear Uli,
>> 
>> edgeR is probably the fastest of the glm negative binomial packages, as we 
>> have done a lot of work moving all the fitting code to the C++ level. 
>> Nevertheless, the ususal glm pipeline will become very slow when there are 
>> over 3000 samples and the design matrix has 1500 columns.
>> 
>> Here are two other possibilities.  First you could switch to voom instead. 
>> I ran voom on your simulated data example in a few minutes on my laptop 
>> computer:
>>
>>   y <- DGEList(counts = reads)
>>   y <- calcNormFactors(y)
>>   v <- voom(y,expdesign,plot=TRUE)
>>   fit <- lmFit(v,expdesign)
>>   fit <- eBayes(fit)
>>   topTable(fit, coef=5)
>> 
>> and so on.  Alternatively, you would stick to edgeR and use the 
>> quasi-likelihood pipeline:
>>
>>   y <- DGEList(counts = reads)
>>   y <- calcNormFactors(y)
>>   y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson")
>>   ql <- glmQLFTest(y,expdesign)
>>   topTags(ql,coef=5)
>> 
>> etc.  The glmQLFTest() function will take more time than voom but less time 
>> than estimateGLMCcommonDispersion on your data.  The common dispersion step 
>> is optional in this pipeline -- I include it because you've already done 
>> it.
>> 
>> Either of these approaches have good statistical motivations.  See the help 
>> pages for voom and glmQLFTest for references to published papers describing 
>> them.
>> 
>> Best wishes
>> Gordon
>> 
>>> Date: Fri, 23 May 2014 09:06:21 -0700 (PDT)
>>> From: "Uli Braunschweig [guest]" <guest at bioconductor.org>
>>> To: bioconductor at r-project.org, u.braunschweig at utoronto.ca
>>> Subject: [BioC] Gene expression analysis with edgeR with a large,
>>>     nested design matrix
>>> 
>>> Hi All,
>>> 
>>> My problem is the following:
>>> I have expression counts for 50 genes in an RNAi screen with 1,536 
>>> treatments (which includes positive and negative controls, so really 1,416 
>>> unique treatments) in two replicates, done in 96-well format (2x16 
>>> plates). I know that plate effects and edge effects (whether a well was 
>>> located on the edge of a plate) are significant, so the design should 
>>> include treatment, plate, and location (edge or interior). Locations are 
>>> identical between replicates. Each plate has two negative controls 
>>> ("siNT"), as well as other controls.
>>> 
>>> I am only interested in the contrasts of each of the treatments vs. the 
>>> "siNT" control. I thought that the model of edgeR would be useful to score 
>>> significant hits while at the same time dealing with the mentioned 
>>> technical biases in a meaningful manner. However, I've had to kill the 
>>> analysis because estimating the trended and tag-wise dispersions takes 
>>> excessively long.
>>> 
>>> My question is: Is it even feasible to try to adress a problem with a 
>>> design that has so few genes and so many treatments using edgeR (or 
>>> DESeq2)?
>>> 
>>> library(edgeR)
>>> 
>>> ## Data look similar to this:
>>> reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50)  # dense matrix of 
>>> 50 genes x (1536 treatments in duplicate)
>>> 
>>> ## Here is how I create my design factors:
>>> # (I've left out 'replicate' because it is a linear combination of plates 
>>> 1-16 and 17-32)
>>> rows  <- rep(rep(1:8, each=12), 32)
>>> cols  <- rep(rep(1:12, 8), 32)
>>> plate <- rep(1:32, each=96)
>>> 
>>> type.nt <-      rows == 1 & cols == 1 |   # the negative control to 
>>> compare everything to; 2 per plate
>>>                rows == 4 & cols == 9
>>> type.posCtl <-  rows == 4 & cols == 3 |   # positive control; 2 per plate
>>>                rows == 5 & cols == 9
>>> type.mock <-    rows == 3 & cols == 3 |   # another control; 2 per plate
>>>                rows == 8 & cols == 12
>>> type.empty <-   plate %in% c(16, 32) & (  # another control; a bunch on 
>>> only 2 plates
>>>                rows %in% c(2:3,6:8) & cols == 9 |
>>>                cols %in% c(10,11) |
>>>                rows %in% 1:7 & cols == 12
>>>                )
>>> type.edge <- rows %in% c(1,8) | cols %in% c(1,12)  # position on the plate
>>> 
>>> treat <- rep(paste("T", 1:1536, sep=""), 2)  # treatments
>>> treat[type.nt]     <- "siNT"
>>> treat[type.mock]   <- "mock"
>>> treat[type.empty]  <- "empty"
>>> treat[type.posCtl] <- "siPosCtl"
>>> treatfac <- relevel(as.factor(treat), ref="siNT")
>>> 
>>> edgefac  <- as.factor(ifelse(type.edge, yes="edge", no="interior"))
>>> platefac <- as.factor(paste("P", plate, sep=""))
>>> 
>>> expfact <- data.frame(treatment = treatfac,
>>>                      platepos  = edgefac,
>>>                      plate     = platefac
>>>                      )
>>> 
>>> expdesign <- model.matrix(formula(~ treatment + plate + platepos), 
>>> data=expfact)
>>> 
>>> ## Estimating the dispersions
>>> y <- DGEList(counts = reads)  # reads is the 50x3072 matrix
>>> y <- calcNormFactors(y)
>>> y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson", 
>>> verbose=TRUE)  # faster than Cox-Reid and probably ok since there are many 
>>> treatments
>>> y <- estimateGLMTrendedDisp(y, design=expdesign)
>>> y <- estimateGLMTagwiseDisp(y, design=expdesign)
>>> ## Neither of the last two steps finish running in a day; same for 
>>> estimateGLMCommonDisp() if method="CoxReid"
>>> 
>>> I was then hoping to extract the contrasts of each treatment against the 
>>> "siNT" control.
>>> Would it make sense to combine the two technical factors, or subset the 
>>> count and design matrices for each treatment in a way that reduces the 
>>> number of treatments, and run them separately? Alternatively, I thought of 
>>> doing the analysis separately for each treatment using the whole count 
>>> matrix but amalgamating all other non-control treatments in an "other" 
>>> group". This seems feasible when run in parallel, but it would be 
>>> overkill...
>>> Any suggestions on how to proceed?
>>> 
>>> Kind regards,
>>> Uli
>>> 
>>> -- 
>>> Ulrich Braunschweig, PhD
>>> 
>>> The Donnelly Centre
>>> University of Toronto
>>> 160 College Street, Room 1030
>>> Toronto, Ontario
>>> Canada M5S 3E1
>>> 
>>> u.braunschweig at utoronto.ca
>>> 
>>> -- output of sessionInfo():
>>> 
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> 
>>> locale:
>>> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_CA.UTF-8
>>> [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_CA.UTF-8
>>> [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C
>>> 
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods base
>>> 
>>> other attached packages:
>>> [1] edgeR_3.6.2  limma_3.20.4

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list