[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq
Ryan C. Thompson
rct at thompsonclan.org
Tue Nov 20 03:08:04 CET 2012
In my work, I've developed a few useful functions relating to edgeR and
DESeq. First, I wrote a version of glmQLFTest that does not throw errors
on zero-count genes. See the comment for details on what exactly it does:
## This version of glmQLFTest excludes genes with zero counts in all
## samples (logCPM == -Inf), because these cause squeezeVar to throw
## errors. Instead, it just arbitrarily sets their F statistics to
## zero and P-Values to 1, which is appropriate because obviously we
## cannot call differential expression if all counts are zero.
glmQLFTest.safe <- function (glmfit, coef = ncol(glmfit$design),
contrast = NULL,
abundance.trend = TRUE)
{
present <- is.finite(glmfit$abundance) & is.finite(glmfit$dispersion)
out <- glmLRT(glmfit, coef = coef, contrast = contrast)
df.residual <- glmfit$df.residual
s2 <- glmfit$deviance/df.residual
df.residual[s2 < 1e-14] <- 0
s2 <- pmax(s2, 0)
if (abundance.trend) {
s2.fit <- squeezeVar(s2[present],
df = df.residual[present],
covariate = glmfit$abundance[present])
}
else {
s2.fit <- squeezeVar(s2, df = df.residual)
}
F <- rep(0, length(s2))
F[present] <- (out$table$LR/out$df.test)[present]/s2.fit$var.post
df.total <- s2.fit$df.prior + df.residual
max.df.residual <- ncol(glmfit$counts) - ncol(glmfit$design)
df.total <- min(df.total, length(s2) * max.df.residual)
F.pvalue <- pf(F, df1 = out$df.test, df2 = df.total, lower.tail = FALSE,
log.p = FALSE)
i <- s2.fit$var.post < 1
if (any(i)) {
chisq.pvalue <- pchisq(out$table$LR[i], df = out$df.test[i],
lower.tail = FALSE, log.p = FALSE)
F.pvalue[i] <- pmax(F.pvalue[i], chisq.pvalue)
}
out$table$LR <- out$table$PValue <- NULL
out$table$F <- F
out$table$PValue <- F.pvalue
out$s2.fit <- s2.fit
out$df.prior <- s2.fit$df.prior
out$df.total <- df.total
out
}
Secondly, I wanted to use edgeR's glmLRT and topTags function to
generate nice tables for my DESeq results, so I wrote a few functions
that allow me to use edgeR's glmFit function on a DESeq CountDataSet to
yield a DGEGLM object just as if glmFit was called on a DGEList object.
I have verified that these functions produce nearly identical fit
coefficients to DESeq's own fitNbinomGLMs, and the resulting p-values
from glmLRT are nearly identical to DESeq's nbinomGLMTest, so this
accurately reproduces the DESeq method using edgeR functions. This
allows you to use edgeR's topTags to get a nice table with PValues, FDR,
logFC, logCPM, and annotations. I also tried glmQLFTest on the results,
but it didn't change any of the p-values because the F-test derived
p-values were all smaller so it just kept the chi-squared-derived
p-values, so I'm not sure if glmQLFTest is valid with DESeq. Anyway,
here are the functions:
## The following two functions allow the use to edgeR's infrastructure
## to execute the DESeq method. Instead of glmFit(dge, design), use
## glmFit.CountDataSet(cds, design), then continue as with a normal
## edgeR analysis.
getOffset.CountDataSet <- function(y) {
if (any(is.na(sizeFactors(y))))
stop("Call estimateDispersions first")
log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
}
glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL,
offset = NULL,
weights = NULL, lib.size = NULL, start
= NULL, method = "auto",
...)
{
stopifnot(is(y, "CountDataSet"))
if (is.null(dispersion)) {
if ("disp_pooled" %in% colnames(fData(y)))
dispersion <- fData(y)$disp_pooled
else if ("disp_blind" %in% colnames(fData(y))) {
if (fitInfo(y, "blind")$sharingMode != "fit-only")
warning("You have used 'method=\"blind\"' in estimateDispersion
without also setting 'sharingMode=\"fit-only\"'. This will not yield
useful results.")
dispersion <- fData(y)$disp_blind
}
else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or
'blind') first.")
}
if (is.null(offset) && is.null(lib.size))
offset <- getOffset.CountDataSet(y)
## UGLY HACK: DESeq can produce infinite dispersion estimates, which
## cause errors in glmFit. To fix this, we replace infinite
## dispersion estimates with the maximum representable floating
## point value, which should always result in a PValue of 1.
infdisp <- !is.finite(dispersion)
dispersion[infdisp] <- .Machine$double.xmax
fit <- glmFit.default(y = counts(y), design = design, dispersion =
dispersion,
offset = offset, weights = weights, lib.size =
lib.size,
start = start, method = method, ...)
## Now set the dispersions back to infinite in the resulting fit object.
fit$dispersion[infdisp] <- +Inf
dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
fit$counts <- counts(y)
fit$samples <- pData(y)[-1]
fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind",
"disp_pooled"))]
new("DGEGLM", fit)
}
Note that while I have named the functions like S3 method, I haven't
actually tested whether method dispatch works properly with them or not,
since DESeq uses S4 classes. In my code, I call them by their full names
instead of relying on method dispatch.
Please let me know if these seem useful to anyone. I've found them
useful, because now I can produce edgeR and DESeq results in exactly the
same format (via topTags).
Regards,
-Ryan Thompson
More information about the Bioc-devel
mailing list