[Bioc-devel] Utility functions and bug fixes for edgeR/DESeq
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Nov 27 06:35:20 CET 2012
Dear Ryan,
I have now committed a change so that glmQLFTest() handles all zero cases
gracefully. It make take a few days to percolate through to the official
release on Bioconductor. Note that the change is to limma, not to edgeR,
and the new version is limma 3.14.2.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
On Wed, 21 Nov 2012, Gordon K Smyth wrote:
> Dear Ryan,
>
> Thank you for raising the issue with all zero counts and glmQLFTest(). I am
> not sure why you are running the code with all zero cases, since we advise
> against it, but I agree that it would be better for the code to handle such
> cases more gracefully.
>
> Rather than posting your own fixes to bioc-devel, the best way to deal with a
> bug in a Bioconductor package is to write to the package authors and get it
> fixed in the package itself. Most authors are grateful to be notified of
> problems and will fix them asap. I have certainly responded to you when you
> have written to me. Indeed I was the one who suggested that you try
> glmQLFTest in the first place. We invite users to send bug reports to us in
> Section 1.3 "how to get help" of the edgeR User's Guide.
>
> On reason why creating a parallel universe of hacked functions is not a good
> idea is that packages change. As I told you a month ago, the glmQLFTest is
> under rapid development, and our non-public code has moved on in many ways
> since the version that you have hacked. In particular, we fixed the issue
> with all zero counts a while back. I admit that we haven't yet committed our
> new versions to the Bioconductor repository. There are so many changes in the
> package that we wanted to completely test and debug everything before making
> them public.
>
> Anyway, I will now give priority to commiting a correction to the official
> Bioconductor release.
>
> Best wishes
> Gordon
>
>> Date: Mon, 19 Nov 2012 18:08:04 -0800
>> From: "Ryan C. Thompson" <rct at thompsonclan.org>
>> To: bioc-devel at r-project.org
>> Subject: [Bioc-devel] Utility functions and bug fixes for edgeR/DESeq
>>
>> In my work, I've developed a few useful functions relating to edgeR and
>> DESeq. First, I wrote a version of glmQLFTest that does not throw errors
>> on zero-count genes. See the comment for details on what exactly it does:
>>
>> ## This version of glmQLFTest excludes genes with zero counts in all
>> ## samples (logCPM == -Inf), because these cause squeezeVar to throw
>> ## errors. Instead, it just arbitrarily sets their F statistics to
>> ## zero and P-Values to 1, which is appropriate because obviously we
>> ## cannot call differential expression if all counts are zero.
>> glmQLFTest.safe <- function (glmfit, coef = ncol(glmfit$design),
>> contrast = NULL,
>> abundance.trend = TRUE)
>> {
>> present <- is.finite(glmfit$abundance) & is.finite(glmfit$dispersion)
>> out <- glmLRT(glmfit, coef = coef, contrast = contrast)
>> df.residual <- glmfit$df.residual
>> s2 <- glmfit$deviance/df.residual
>> df.residual[s2 < 1e-14] <- 0
>> s2 <- pmax(s2, 0)
>> if (abundance.trend) {
>> s2.fit <- squeezeVar(s2[present],
>> df = df.residual[present],
>> covariate = glmfit$abundance[present])
>> }
>> else {
>> s2.fit <- squeezeVar(s2, df = df.residual)
>> }
>> F <- rep(0, length(s2))
>> F[present] <- (out$table$LR/out$df.test)[present]/s2.fit$var.post
>> df.total <- s2.fit$df.prior + df.residual
>> max.df.residual <- ncol(glmfit$counts) - ncol(glmfit$design)
>> df.total <- min(df.total, length(s2) * max.df.residual)
>> F.pvalue <- pf(F, df1 = out$df.test, df2 = df.total, lower.tail = FALSE,
>> log.p = FALSE)
>> i <- s2.fit$var.post < 1
>> if (any(i)) {
>> chisq.pvalue <- pchisq(out$table$LR[i], df = out$df.test[i],
>> lower.tail = FALSE, log.p = FALSE)
>> F.pvalue[i] <- pmax(F.pvalue[i], chisq.pvalue)
>> }
>> out$table$LR <- out$table$PValue <- NULL
>> out$table$F <- F
>> out$table$PValue <- F.pvalue
>> out$s2.fit <- s2.fit
>> out$df.prior <- s2.fit$df.prior
>> out$df.total <- df.total
>> out
>> }
>>
>>
>>
>>
>> Secondly, I wanted to use edgeR's glmLRT and topTags function to
>> generate nice tables for my DESeq results, so I wrote a few functions
>> that allow me to use edgeR's glmFit function on a DESeq CountDataSet to
>> yield a DGEGLM object just as if glmFit was called on a DGEList object.
>> I have verified that these functions produce nearly identical fit
>> coefficients to DESeq's own fitNbinomGLMs, and the resulting p-values
>> from glmLRT are nearly identical to DESeq's nbinomGLMTest, so this
>> accurately reproduces the DESeq method using edgeR functions. This
>> allows you to use edgeR's topTags to get a nice table with PValues, FDR,
>> logFC, logCPM, and annotations. I also tried glmQLFTest on the results,
>> but it didn't change any of the p-values because the F-test derived
>> p-values were all smaller so it just kept the chi-squared-derived
>> p-values, so I'm not sure if glmQLFTest is valid with DESeq. Anyway,
>> here are the functions:
>>
>> ## The following two functions allow the use to edgeR's infrastructure
>> ## to execute the DESeq method. Instead of glmFit(dge, design), use
>> ## glmFit.CountDataSet(cds, design), then continue as with a normal
>> ## edgeR analysis.
>> getOffset.CountDataSet <- function(y) {
>> if (any(is.na(sizeFactors(y))))
>> stop("Call estimateDispersions first")
>> log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
>> }
>>
>> glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL,
>> offset = NULL,
>> weights = NULL, lib.size = NULL, start
>> = NULL, method = "auto",
>> ...)
>> {
>> stopifnot(is(y, "CountDataSet"))
>> if (is.null(dispersion)) {
>> if ("disp_pooled" %in% colnames(fData(y)))
>> dispersion <- fData(y)$disp_pooled
>> else if ("disp_blind" %in% colnames(fData(y))) {
>> if (fitInfo(y, "blind")$sharingMode != "fit-only")
>> warning("You have used 'method=\"blind\"' in estimateDispersion
>> without also setting 'sharingMode=\"fit-only\"'. This will not yield
>> useful results.")
>> dispersion <- fData(y)$disp_blind
>> }
>> else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or
>> 'blind') first.")
>> }
>> if (is.null(offset) && is.null(lib.size))
>> offset <- getOffset.CountDataSet(y)
>> ## UGLY HACK: DESeq can produce infinite dispersion estimates, which
>> ## cause errors in glmFit. To fix this, we replace infinite
>> ## dispersion estimates with the maximum representable floating
>> ## point value, which should always result in a PValue of 1.
>> infdisp <- !is.finite(dispersion)
>> dispersion[infdisp] <- .Machine$double.xmax
>> fit <- glmFit.default(y = counts(y), design = design, dispersion =
>> dispersion,
>> offset = offset, weights = weights, lib.size =
>> lib.size,
>> start = start, method = method, ...)
>> ## Now set the dispersions back to infinite in the resulting fit object.
>> fit$dispersion[infdisp] <- +Inf
>> dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
>> fit$counts <- counts(y)
>> fit$samples <- pData(y)[-1]
>> fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind",
>> "disp_pooled"))]
>> new("DGEGLM", fit)
>> }
>>
>> Note that while I have named the functions like S3 method, I haven't
>> actually tested whether method dispatch works properly with them or not,
>> since DESeq uses S4 classes. In my code, I call them by their full names
>> instead of relying on method dispatch.
>>
>> Please let me know if these seem useful to anyone. I've found them
>> useful, because now I can produce edgeR and DESeq results in exactly the
>> same format (via topTags).
>>
>> Regards,
>>
>> -Ryan Thompson
>>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioc-devel
mailing list