[BioC] [ComBat] ComBat outputting NAs with Large Batches - Switch to Log Likelihood
Jason Stein
steinja at ucla.edu
Wed Nov 21 20:00:33 CET 2012
Hi fellow ComBat fans,
Not sure if my previous message posted to the listserv because of the
attachment. Now re-sending without attachment:
I was having a problem with ComBat when using large batch sizes and
the non-parametric parameter estimation. Almost all probe intensity
values after ComBat correction would be NAs.
I've traced the source of this problem and have a solution (switching
from likelihood to log likelihood) described below. I hope you all
find it helpful, or that you would let me know if you find any errors.
Also, I can send the attachment which shows why I changed what I
changed if anyone would like to take a look.
ComBat correction
ComBat was sometimes outputting NAs for all probe values using
non-parametric parameter estimates. I traced this to the likelihood
calculation in the int.eprior function. The original function is
pasted below:
# Monte Carlo integration function to find the nonparametric
adjustments for a particular batch
int.eprior <- function(sdat,g.hat,d.hat){
g.star <- d.star <- NULL
r <- nrow(sdat)
for(i in 1:r){
g <- g.hat[-i]
d <- d.hat[-i]
x <- sdat[i,!is.na(sdat[i,])]
n <- length(x)
j <- numeric(n)+1
dat <- matrix(as.numeric(x),length(g),n,byrow=T)
resid2 <- (dat-g)^2
sum2 <- resid2%*%j
LH <- 1/(2*pi*d)^(n/2)*exp(-sum2/(2*d))
LH[LH=="NaN"]=0
g.star <- c(g.star,sum(g*LH)/sum(LH))
d.star <- c(d.star,sum(d*LH)/sum(LH))
##if(i%%1000==0){cat(i,'\n')}
}
adjust <- rbind(g.star,d.star)
rownames(adjust) <- c("g.star","d.star")
adjust
}
The likelihood calculation LH here goes to zero when sum2 is very
large due to numerical precision errors. This causes the parameter
estimates g.star and d.star to become NaN from a divide by zero error,
and consequently for the output of the program to be NAs. This can
happen when there are a lot of samples in a batch or when there are
large residuals (or both). To correct this, can use the
log-likelihood, which is less susceptible to numerical precision
errors since it is not so small.
I changed the function to use log-likelihood and the entire edited
function is now pasted below:
# Monte Carlo integration function to find the nonparametric
adjustments for a particular batch
int.eprior <- function(sdat,g.hat,d.hat){
#Get the number of probes
r <- nrow(sdat);
#The number of probes removing the probe of interest
G = r - 1;
#g.star and d.star are the output matrices
g.star <- d.star <- matrix(NA,nrow=G,ncol=1);
#Loop over each probe
for(i in 1:r){
#Remove the beta value of the probe of interest
g <- g.hat[-i];
#Remove the variance of the probe of interest
d <- d.hat[-i];
#Isolate the value of this probe across all samples
x <- sdat[i,!is.na(sdat[i,])];
#Number of samples in this batch
n <- length(x)
#A matrix of dimension gxn which contains the probe values
repeated in each row
dat <- matrix(as.numeric(x),length(g),n,byrow=T)
#The residual between the probe values and the regression coefficient
resid2 <- (dat-g)^2
#Calculate Log-likelihood using Gaussian
LL = matrix(0,nrow=length(G),ncol=1);
for (j in 1:n) {
LL = LL + -0.5*log(2*pi*d)-(resid2[,j]/(2*d));
}
#Order by the log-likelihood values to avoid Inf in parameter estimates
orderind = order(LL,decreasing=TRUE);
LL = LL[orderind];
g = g[orderind];
d = d[orderind];
#Determine the parameter estimates
numerator.g = 1;
denominator = 1;
numerator.d = 1;
for (k in 2:G) {
numerator.g = numerator.g + (g[k]/g[1]) * exp(LL[k]-LL[1]);
denominator = denominator + exp(LL[k]-LL[1]);
numerator.d = numerator.d + (d[k]/d[1]) * exp(LL[k]-LL[1]);
}
g.star[i] = g[1]*(numerator.g/denominator);
d.star[i] = d[1]*(numerator.d/denominator);
}
adjust <- rbind(g.star,d.star)
rownames(adjust) <- c("g.star","d.star")
adjust
}
More information about the Bioconductor
mailing list