[BioC] Back-estimating batch variables from SVA for ComBat?

Tue Aug 7 18:36:00 CEST 2012

Thanks, Andrew! We'll give this a shot; just wasn't sure if there was some way to apply ComBat directly or if you had to do something more along these lines. 

Hollis Wright, PhD
Ojeda Lab, Division of Neuroscience
Oregon Health and Science University

________________________________________
From: Andrew Jaffe [ajaffe at jhsph.edu]
Sent: Tuesday, August 07, 2012 7:15 AM
To: bioconductor at r-project.org
Cc: Hollis Wright
Subject: [BioC] Back-estimating batch variables from SVA for ComBat?

Hollis,

There's no need to "estimate" the batch inputs for ComBat after running SVA - you can just adjust for the surrogate variables themselves in any downstream analyses. For something like WGCNA, or any other clustering algorithm, you can regress the surrogate variables out of the expression data (see code below). In order to minimize the loss of biological signal, be sure to properly specify your model matrix ('mod') prior to running SVA - any variable you put in the model will be "protected" from being treated as unexplained heterogeneity that SVA estimates. Specifying the model matrix might be a little tricky given the potential longitudinal nature of the data, but you can be fairly flexible with the linear model that you input into SVA. You could also try manually inputting the estimation of fewer surrogate variables if you want to be conservative about removing biology, at the expense of possibly preserving some unexplained heterogeneity using the "n.sv<http://n.sv>" argument in SVA

When you regress the surrogate variables out of the expression data, be sure to fit the entire model including your covariates of interest (and not just the surrogate variables) - otherwise, the cleaned data will not look as good. You can use the following function:

cleaningY = function(y, mod, svaobj) {
X=cbind(mod,svaobj$sv)
Hat=solve(t(X)%*%X)%*%t(X)
beta=(Hat%*%t(y))
P=ncol(mod)
cleany=y-t(as.matrix(X[,-c(1:P)])%*%beta[-c(1:P),])
return(cleany)
}
# and implement it like this:
mod = model.matrix(~[whatever your model is]) # specify the model
svaobj = sva(y, mod) # y is your expression matrix
cleany = cleaningY(y,mod,svaobj)
WGCNA(cleany,...) # or whatever the format is...

Hope that helps,
Andrew

Message: 9
Date: Mon, 6 Aug 2012 14:50:03 -0700
From: Hollis Wright <wrighth at ohsu.edu<mailto:wrighth at ohsu.edu>>
To: "bioconductor at r-project.org<mailto:bioconductor at r-project.org>" <bioconductor at r-project.org<mailto:bioconductor at r-project.org>>
Subject: [BioC] Back-estimating batch variables from SVA for ComBat?
Message-ID:
        <CFC8814F553202438E9A18DC1E512E2806F93ED0A0 at EX-MB04.ohsu.edu<mailto:CFC8814F553202438E9A18DC1E512E2806F93ED0A0 at EX-MB04.ohsu.edu>>
Content-Type: text/plain; charset="us-ascii"

Hi, all; we're working with some gene expression data and we suspect that there may be some irregularities in arrays; unfortunately, these arrays were run some time ago and we're not actually sure what the batches (if any) they were run in were at this point; complicating matters, this is also a timepoint analysis so there's some technical variability there most likely. Long-run, we'd like to run WGCNA and some similar methods on this data and I'd had good luck with ComBat adjusting for batches in a previous case where we did have that information. Is there any way to use sva estimates to estimate what the batches "should" be for our data? sva does estimate and find three significant latent variables but I'm not sure what (if anything) can be done from there in terms of adjusting the expression levels to compensate for the latent variables; obviously we'd also be concerned about losing the biological variability if we make any adjustments. Is this possible, or am I heading i!
 n the wrong direction here?

Hollis Wright, PhD
Ojeda Lab, Division of Neuroscience
Oregon Health and Science University