[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq

Tue Nov 20 03:08:04 CET 2012

In my work, I've developed a few useful functions relating to edgeR and 
DESeq. First, I wrote a version of glmQLFTest that does not throw errors 
on zero-count genes. See the comment for details on what exactly it does:

## This version of glmQLFTest excludes genes with zero counts in all
## samples (logCPM == -Inf), because these cause squeezeVar to throw
## errors. Instead, it just arbitrarily sets their F statistics to
## zero and P-Values to 1, which is appropriate because obviously we
## cannot call differential expression if all counts are zero.
glmQLFTest.safe <- function (glmfit, coef = ncol(glmfit$design), 
contrast = NULL,
                              abundance.trend = TRUE)
{
   present <- is.finite(glmfit$abundance) & is.finite(glmfit$dispersion)
   out <- glmLRT(glmfit, coef = coef, contrast = contrast)
   df.residual <- glmfit$df.residual
   s2 <- glmfit$deviance/df.residual
   df.residual[s2 < 1e-14] <- 0
   s2 <- pmax(s2, 0)
   if (abundance.trend)  {
     s2.fit <- squeezeVar(s2[present],
                          df = df.residual[present],
                          covariate = glmfit$abundance[present])
   }
   else {
     s2.fit <- squeezeVar(s2, df = df.residual)
   }
   F <- rep(0, length(s2))
   F[present] <- (out$table$LR/out$df.test)[present]/s2.fit$var.post
   df.total <- s2.fit$df.prior + df.residual
   max.df.residual <- ncol(glmfit$counts) - ncol(glmfit$design)
   df.total <- min(df.total, length(s2) * max.df.residual)
   F.pvalue <- pf(F, df1 = out$df.test, df2 = df.total, lower.tail = FALSE,
                  log.p = FALSE)
   i <- s2.fit$var.post < 1
   if (any(i)) {
     chisq.pvalue <- pchisq(out$table$LR[i], df = out$df.test[i],
                            lower.tail = FALSE, log.p = FALSE)
     F.pvalue[i] <- pmax(F.pvalue[i], chisq.pvalue)
   }
   out$table$LR <- out$table$PValue <- NULL
   out$table$F <- F
   out$table$PValue <- F.pvalue
   out$s2.fit <- s2.fit
   out$df.prior <- s2.fit$df.prior
   out$df.total <- df.total
   out
}

Secondly, I wanted to use edgeR's glmLRT and topTags function to 
generate nice tables for my DESeq results, so I wrote a few functions 
that allow me to use edgeR's glmFit function on a DESeq CountDataSet to 
yield a DGEGLM object just as if glmFit was called on a DGEList object. 
I have verified that these functions produce nearly identical fit 
coefficients to DESeq's own fitNbinomGLMs, and the resulting p-values 
from glmLRT are nearly identical to DESeq's nbinomGLMTest, so this 
accurately reproduces the DESeq method using edgeR functions. This 
allows you to use edgeR's topTags to get a nice table with PValues, FDR, 
logFC, logCPM, and annotations. I also tried glmQLFTest on the results, 
but it didn't change any of the p-values because the F-test derived 
p-values were all smaller so it just kept the chi-squared-derived 
p-values, so I'm not sure if glmQLFTest is valid with DESeq. Anyway, 
here are the functions:

## The following two functions allow the use to edgeR's infrastructure
## to execute the DESeq method. Instead of glmFit(dge, design), use
## glmFit.CountDataSet(cds, design), then continue as with a normal
## edgeR analysis.
getOffset.CountDataSet <- function(y) {
   if (any(is.na(sizeFactors(y))))
     stop("Call estimateDispersions first")
   log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
}

glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL, 
offset = NULL,
                                  weights = NULL, lib.size = NULL, start 
= NULL, method = "auto",
                                  ...)
{
   stopifnot(is(y, "CountDataSet"))
   if (is.null(dispersion)) {
     if ("disp_pooled" %in% colnames(fData(y)))
       dispersion <- fData(y)$disp_pooled
     else if ("disp_blind" %in% colnames(fData(y))) {
       if (fitInfo(y, "blind")$sharingMode != "fit-only")
         warning("You have used 'method=\"blind\"' in estimateDispersion 
without also setting 'sharingMode=\"fit-only\"'. This will not yield 
useful results.")
       dispersion <- fData(y)$disp_blind
     }
     else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or 
'blind') first.")
   }
   if (is.null(offset) && is.null(lib.size))
     offset <- getOffset.CountDataSet(y)
   ## UGLY HACK: DESeq can produce infinite dispersion estimates, which
   ## cause errors in glmFit. To fix this, we replace infinite
   ## dispersion estimates with the maximum representable floating
   ## point value, which should always result in a PValue of 1.
   infdisp <- !is.finite(dispersion)
   dispersion[infdisp] <- .Machine$double.xmax
   fit <- glmFit.default(y = counts(y), design = design, dispersion = 
dispersion,
                         offset = offset, weights = weights, lib.size = 
lib.size,
                         start = start, method = method, ...)
   ## Now set the dispersions back to infinite in the resulting fit object.
   fit$dispersion[infdisp] <- +Inf
   dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
   fit$counts <- counts(y)
   fit$samples <- pData(y)[-1]
   fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind", 
"disp_pooled"))]
   new("DGEGLM", fit)
}

Note that while I have named the functions like S3 method, I haven't 
actually tested whether method dispatch works properly with them or not, 
since DESeq uses S4 classes. In my code, I call them by their full names 
instead of relying on method dispatch.

Please let me know if these seem useful to anyone. I've found them 
useful, because now I can produce edgeR and DESeq results in exactly the 
same format (via topTags).

Regards,

-Ryan Thompson