[BioC] DESeq: GLM for multi-factor and mulit-level designs
Ryan C. Thompson
rct at thompsonclan.org
Mon Nov 26 22:15:28 CET 2012
If you are more familiar with edgeR, I have written a version of edgeR's
glmFit that takes a DESeq CountDataSet and returns an edgeR DGEGLM
object. Then you can use edgeR's GLM functionality, which directly
supports the kind of comparisions you are seeking.
I have tested these functions and found that they produce identical
results to DESeq's GLM methods when running the same test on the same
CountDataSet object, so I believe that they will also produce valid
results for tests that DESeq's methods do not directly support.
If people find these useful, I'd like to get them included in
Bioconductor, but I'm not sure which package they belong in. Suggestions
welcome.
## The following two functions allow the use of edgeR's infrastructure
## to execute the DESeq method. Instead of glmFit(dge, design), use
## glmFit.CountDataSet(cds, design), then continue as with a normal
## edgeR analysis.
getOffset.CountDataSet <- function(y) {
if (any(is.na(sizeFactors(y))))
stop("Call estimateDispersions first")
log(sizeFactors(y)) - mean(log(sizeFactors(y) / colSums(counts(y))))
}
glmFit.CountDataSet <- function (y, design = NULL, dispersion = NULL,
offset = NULL,
weights = NULL, lib.size = NULL, start
= NULL, method = "auto",
...)
{
stopifnot(is(y, "CountDataSet"))
if (is.null(dispersion)) {
if ("disp_pooled" %in% colnames(fData(y)))
dispersion <- fData(y)$disp_pooled
else if ("disp_blind" %in% colnames(fData(y))) {
if (fitInfo(y, "blind")$sharingMode != "fit-only")
warning("You have used 'method=\"blind\"' in estimateDispersion
without also setting 'sharingMode=\"fit-only\"'. This will not yield
useful results.")
dispersion <- fData(y)$disp_blind
}
else stop("Call 'estimateDispersions' with 'method=\"pooled\"' (or
'blind') first.")
}
if (is.null(offset) && is.null(lib.size))
offset <- getOffset.CountDataSet(y)
## UGLY HACK: DESeq can produce infinite dispersion estimates, which
## cause errors in glmFit. To fix this, we replace infinite
## dispersion estimates with the maximum representable floating
## point value, which should always result in a PValue of 1.
infdisp <- !is.finite(dispersion)
dispersion[infdisp] <- .Machine$double.xmax
fit <- glmFit.default(y = counts(y), design = design, dispersion =
dispersion,
offset = offset, weights = weights, lib.size =
lib.size,
start = start, method = method, ...)
## Now set the dispersions back to infinite in the resulting fit object.
fit$dispersion[infdisp] <- +Inf
dimnames(fit$coefficients) <- list(rownames(counts(y)), colnames(design))
fit$counts <- counts(y)
fit$samples <- pData(y)[-1]
fit$genes <- fData(y)[setdiff(names(fData(y)), c("disp_blind",
"disp_pooled"))]
new("DGEGLM", fit)
}
On 11/26/2012 08:34 AM, Dorota Herman wrote:
> Hello everyone,
>
> I am currently working on GLM for RNA-seq data in DESeq for a design
> with two factors and three factor levels, such as:
>
> type experiments
> A_1 A exper1
> A_2 A exper2
> A_3 A exper3
> B_1 B exper2
> B_2 B exper3
> B_3 B exper4
> C_1 C exper4
> C_2 C exper2
> C_3 C exper1
>
> In the recent example of the GLM application there is a fixme note:
>
> “Can we add a paragraph on what to do if we only want to make specific
> comparisons, e.g. let fac be a factor with three levels A, B and C,
> and we want to test pairwise for differences between A and B (without
> C), between A and C (without B) etc.” .... and between B and C
> (without A)
>
> Is it possible to conduct such a analysis with the current DESeq version?
>
> Thanks a lot
> Dorota
>
More information about the Bioconductor
mailing list