[BioC] DESeq: GLM for multi-factor and mulit-level designs

Mon Nov 26 22:15:28 CET 2012

If you are more familiar with edgeR, I have written a version of edgeR's 
glmFit that takes a DESeq CountDataSet and returns an edgeR DGEGLM 
object. Then you can use edgeR's GLM functionality, which directly 
supports the kind of comparisions you are seeking.

I have tested these functions and found that they produce identical 
results to DESeq's GLM methods when running the same test on the same 
CountDataSet object, so I believe that they will also produce valid 
results for tests that DESeq's methods do not directly support.

If people find these useful, I'd like to get them included in 
Bioconductor, but I'm not sure which package they belong in. Suggestions 
welcome.

## The following two functions allow the use of edgeR's infrastructure
## to execute the DESeq method. Instead of glmFit(dge, design), use
## glmFit.CountDataSet(cds, design), then continue as with a normal
## edgeR analysis.
getOffset.CountDataSet <- function(y) {
   if (any(is.na(sizeFactors(y))))
     stop("Call estimateDispersions first")
   log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
}

glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL, 
offset = NULL,
                                  weights = NULL, lib.size = NULL, start 
= NULL, method = "auto",
                                  ...)
{
   stopifnot(is(y, "CountDataSet"))
   if (is.null(dispersion)) {
     if ("disp_pooled" %in% colnames(fData(y)))
       dispersion <- fData(y)$disp_pooled
     else if ("disp_blind" %in% colnames(fData(y))) {
       if (fitInfo(y, "blind")$sharingMode != "fit-only")
         warning("You have used 'method=\"blind\"' in estimateDispersion 
without also setting 'sharingMode=\"fit-only\"'. This will not yield 
useful results.")
       dispersion <- fData(y)$disp_blind
     }
     else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or 
'blind') first.")
   }
   if (is.null(offset) && is.null(lib.size))
     offset <- getOffset.CountDataSet(y)
   ## UGLY HACK: DESeq can produce infinite dispersion estimates, which
   ## cause errors in glmFit. To fix this, we replace infinite
   ## dispersion estimates with the maximum representable floating
   ## point value, which should always result in a PValue of 1.
   infdisp <- !is.finite(dispersion)
   dispersion[infdisp] <- .Machine$double.xmax
   fit <- glmFit.default(y = counts(y), design = design, dispersion = 
dispersion,
                         offset = offset, weights = weights, lib.size = 
lib.size,
                         start = start, method = method, ...)
   ## Now set the dispersions back to infinite in the resulting fit object.
   fit$dispersion[infdisp] <- +Inf
   dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
   fit$counts <- counts(y)
   fit$samples <- pData(y)[-1]
   fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind", 
"disp_pooled"))]
   new("DGEGLM", fit)
}

On 11/26/2012 08:34 AM, Dorota Herman wrote:
> Hello everyone,
>
> I am currently working on GLM for RNA-seq data in DESeq for a design 
> with two factors and three factor levels, such as:
>
> type experiments
> A_1 A exper1
> A_2 A exper2
> A_3 A exper3
> B_1 B exper2
> B_2 B exper3
> B_3 B exper4
> C_1 C exper4
> C_2 C exper2
> C_3 C exper1
>
> In the recent example of the GLM application there is a fixme note:
>
> “Can we add a paragraph on what to do if we only want to make specific 
> comparisons, e.g. let fac be a factor with three levels A, B and C, 
> and we want to test pairwise for differences between A and B (without 
> C), between A and C (without B) etc.” .... and between B and C 
> (without A)
>
> Is it possible to conduct such a analysis with the current DESeq version?
>
> Thanks a lot
> Dorota
>