[BioC] Gene expression analysis with edgeR with a large, nested design matrix

Fri May 23 18:06:21 CEST 2014

Hi All,

My problem is the following:
I have expression counts for 50 genes in an RNAi screen with 1,536 treatments (which includes positive and negative controls, so really 1,416 unique treatments) in two replicates, done in 96-well format (2x16 plates). I know that plate effects and edge effects (whether a well was located on the edge of a plate) are significant, so the design should include treatment, plate, and location (edge or interior). Locations are identical between replicates. Each plate has two negative controls ("siNT"), as well as other controls.

I am only interested in the contrasts of each of the treatments vs. the "siNT" control. I thought that the model of edgeR would be useful to score significant hits while at the same time dealing with the mentioned technical biases in a meaningful manner. However, I've had to kill the analysis because estimating the trended and tag-wise dispersions takes excessively long.

My question is: Is it even feasible to try to adress a problem with a design that has so few genes and so many treatments using edgeR (or DESeq2)?

library(edgeR)

## Data look similar to this:
reads <- matrix(round(2000 * rexp(50 * 3072)), nrow=50)  # dense matrix of 50 genes x (1536 treatments in duplicate)

## Here is how I create my design factors:
# (I've left out 'replicate' because it is a linear combination of plates 1-16 and 17-32)
rows  <- rep(rep(1:8, each=12), 32)
cols  <- rep(rep(1:12, 8), 32)
plate <- rep(1:32, each=96)

type.nt <-      rows == 1 & cols == 1 |   # the negative control to compare everything to; 2 per plate
                rows == 4 & cols == 9
type.posCtl <-  rows == 4 & cols == 3 |   # positive control; 2 per plate
                rows == 5 & cols == 9
type.mock <-    rows == 3 & cols == 3 |   # another control; 2 per plate
                rows == 8 & cols == 12
type.empty <-   plate %in% c(16, 32) & (  # another control; a bunch on only 2 plates
                rows %in% c(2:3,6:8) & cols == 9 |
                cols %in% c(10,11) |
                rows %in% 1:7 & cols == 12
                )
type.edge <- rows %in% c(1,8) | cols %in% c(1,12)  # position on the plate

treat <- rep(paste("T", 1:1536, sep=""), 2)  # treatments
treat[type.nt]     <- "siNT"
treat[type.mock]   <- "mock"
treat[type.empty]  <- "empty"
treat[type.posCtl] <- "siPosCtl" 
treatfac <- relevel(as.factor(treat), ref="siNT")

edgefac  <- as.factor(ifelse(type.edge, yes="edge", no="interior"))
platefac <- as.factor(paste("P", plate, sep=""))

expfact <- data.frame(treatment = treatfac,
                      platepos  = edgefac,
                      plate     = platefac 
                      )

expdesign <- model.matrix(formula(~ treatment + plate + platepos), data=expfact)

## Estimating the dispersions
y <- DGEList(counts = reads)  # reads is the 50x3072 matrix
y <- calcNormFactors(y)
y <- estimateGLMCommonDisp(y, design=expdesign, method="Pearson", verbose=TRUE)  # faster than Cox-Reid and probably ok since there are many treatments
y <- estimateGLMTrendedDisp(y, design=expdesign)
y <- estimateGLMTagwiseDisp(y, design=expdesign)
## Neither of the last two steps finish running in a day; same for estimateGLMCommonDisp() if method="CoxReid"

I was then hoping to extract the contrasts of each treatment against the "siNT" control.
Would it make sense to combine the two technical factors, or subset the count and design matrices for each treatment in a way that reduces the number of treatments, and run them separately? Alternatively, I thought of doing the analysis separately for each treatment using the whole count matrix but amalgamating all other non-control treatments in an "other" group". This seems feasible when run in parallel, but it would be overkill...
Any suggestions on how to proceed?

Kind regards,
Uli

-- 
Ulrich Braunschweig, PhD

The Donnelly Centre
University of Toronto
160 College Street, Room 1030
Toronto, Ontario
Canada M5S 3E1

u.braunschweig at utoronto.ca 

 -- output of sessionInfo(): 

R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.6.2  limma_3.20.4

--
Sent via the guest posting facility at bioconductor.org.