[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq

Wed Nov 21 01:25:28 CET 2012

Dear Ryan,

Thank you for raising the issue with all zero counts and glmQLFTest().  I 
am not sure why you are running the code with all zero cases, since we 
advise against it, but I agree that it would be better for the code to 
handle such cases more gracefully.

Rather than posting your own fixes to bioc-devel, the best way to deal 
with a bug in a Bioconductor package is to write to the package authors 
and get it fixed in the package itself.  Most authors are grateful to be 
notified of problems and will fix them asap.  I have certainly responded 
to you when you have written to me.  Indeed I was the one who suggested 
that you try glmQLFTest in the first place.  We invite users to send bug 
reports to us in Section 1.3 "how to get help" of the edgeR User's Guide.

On reason why creating a parallel universe of hacked functions is not a 
good idea is that packages change.  As I told you a month ago, the 
glmQLFTest is under rapid development, and our non-public code has moved 
on in many ways since the version that you have hacked.  In particular, we 
fixed the issue with all zero counts a while back.  I admit that we 
haven't yet committed our new versions to the Bioconductor repository. 
There are so many changes in the package that we wanted to completely test 
and debug everything before making them public.

Anyway, I will now give priority to commiting a correction to the official 
Bioconductor release.

Best wishes
Gordon

> Date: Mon, 19 Nov 2012 18:08:04 -0800
> From: "Ryan C. Thompson" <rct at thompsonclan.org>
> To: bioc-devel at r-project.org
> Subject: [Bioc-devel] Utility functions and bug fixes for edgeR/DESeq
>
> In my work, I've developed a few useful functions relating to edgeR and
> DESeq. First, I wrote a version of glmQLFTest that does not throw errors
> on zero-count genes. See the comment for details on what exactly it does:
>
> ## This version of glmQLFTest excludes genes with zero counts in all
> ## samples (logCPM == -Inf), because these cause squeezeVar to throw
> ## errors. Instead, it just arbitrarily sets their F statistics to
> ## zero and P-Values to 1, which is appropriate because obviously we
> ## cannot call differential expression if all counts are zero.
> glmQLFTest.safe <- function (glmfit, coef = ncol(glmfit$design),
> contrast = NULL,
>                              abundance.trend = TRUE)
> {
>   present <- is.finite(glmfit$abundance) & is.finite(glmfit$dispersion)
>   out <- glmLRT(glmfit, coef = coef, contrast = contrast)
>   df.residual <- glmfit$df.residual
>   s2 <- glmfit$deviance/df.residual
>   df.residual[s2 < 1e-14] <- 0
>   s2 <- pmax(s2, 0)
>   if (abundance.trend)  {
>     s2.fit <- squeezeVar(s2[present],
>                          df = df.residual[present],
>                          covariate = glmfit$abundance[present])
>   }
>   else {
>     s2.fit <- squeezeVar(s2, df = df.residual)
>   }
>   F <- rep(0, length(s2))
>   F[present] <- (out$table$LR/out$df.test)[present]/s2.fit$var.post
>   df.total <- s2.fit$df.prior + df.residual
>   max.df.residual <- ncol(glmfit$counts) - ncol(glmfit$design)
>   df.total <- min(df.total, length(s2) * max.df.residual)
>   F.pvalue <- pf(F, df1 = out$df.test, df2 = df.total, lower.tail = FALSE,
>                  log.p = FALSE)
>   i <- s2.fit$var.post < 1
>   if (any(i)) {
>     chisq.pvalue <- pchisq(out$table$LR[i], df = out$df.test[i],
>                            lower.tail = FALSE, log.p = FALSE)
>     F.pvalue[i] <- pmax(F.pvalue[i], chisq.pvalue)
>   }
>   out$table$LR <- out$table$PValue <- NULL
>   out$table$F <- F
>   out$table$PValue <- F.pvalue
>   out$s2.fit <- s2.fit
>   out$df.prior <- s2.fit$df.prior
>   out$df.total <- df.total
>   out
> }
>
>
>
>
> Secondly, I wanted to use edgeR's glmLRT and topTags function to
> generate nice tables for my DESeq results, so I wrote a few functions
> that allow me to use edgeR's glmFit function on a DESeq CountDataSet to
> yield a DGEGLM object just as if glmFit was called on a DGEList object.
> I have verified that these functions produce nearly identical fit
> coefficients to DESeq's own fitNbinomGLMs, and the resulting p-values
> from glmLRT are nearly identical to DESeq's nbinomGLMTest, so this
> accurately reproduces the DESeq method using edgeR functions. This
> allows you to use edgeR's topTags to get a nice table with PValues, FDR,
> logFC, logCPM, and annotations. I also tried glmQLFTest on the results,
> but it didn't change any of the p-values because the F-test derived
> p-values were all smaller so it just kept the chi-squared-derived
> p-values, so I'm not sure if glmQLFTest is valid with DESeq. Anyway,
> here are the functions:
>
> ## The following two functions allow the use to edgeR's infrastructure
> ## to execute the DESeq method. Instead of glmFit(dge, design), use
> ## glmFit.CountDataSet(cds, design), then continue as with a normal
> ## edgeR analysis.
> getOffset.CountDataSet <- function(y) {
>   if (any(is.na(sizeFactors(y))))
>     stop("Call estimateDispersions first")
>   log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
> }
>
> glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL,
> offset = NULL,
>                                  weights = NULL, lib.size = NULL, start
> = NULL, method = "auto",
>                                  ...)
> {
>   stopifnot(is(y, "CountDataSet"))
>   if (is.null(dispersion)) {
>     if ("disp_pooled" %in% colnames(fData(y)))
>       dispersion <- fData(y)$disp_pooled
>     else if ("disp_blind" %in% colnames(fData(y))) {
>       if (fitInfo(y, "blind")$sharingMode != "fit-only")
>         warning("You have used 'method=\"blind\"' in estimateDispersion
> without also setting 'sharingMode=\"fit-only\"'. This will not yield
> useful results.")
>       dispersion <- fData(y)$disp_blind
>     }
>     else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or
> 'blind') first.")
>   }
>   if (is.null(offset) && is.null(lib.size))
>     offset <- getOffset.CountDataSet(y)
>   ## UGLY HACK: DESeq can produce infinite dispersion estimates, which
>   ## cause errors in glmFit. To fix this, we replace infinite
>   ## dispersion estimates with the maximum representable floating
>   ## point value, which should always result in a PValue of 1.
>   infdisp <- !is.finite(dispersion)
>   dispersion[infdisp] <- .Machine$double.xmax
>   fit <- glmFit.default(y = counts(y), design = design, dispersion =
> dispersion,
>                         offset = offset, weights = weights, lib.size =
> lib.size,
>                         start = start, method = method, ...)
>   ## Now set the dispersions back to infinite in the resulting fit object.
>   fit$dispersion[infdisp] <- +Inf
>   dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
>   fit$counts <- counts(y)
>   fit$samples <- pData(y)[-1]
>   fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind",
> "disp_pooled"))]
>   new("DGEGLM", fit)
> }
>
> Note that while I have named the functions like S3 method, I haven't
> actually tested whether method dispatch works properly with them or not,
> since DESeq uses S4 classes. In my code, I call them by their full names
> instead of relying on method dispatch.
>
> Please let me know if these seem useful to anyone. I've found them
> useful, because now I can produce edgeR and DESeq results in exactly the
> same format (via topTags).
>
> Regards,
>
> -Ryan Thompson
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}