[BioC] [ComBat] ComBat outputting NAs with Large Batches - Switch to Log Likelihood

Mon Nov 26 15:29:11 CET 2012

Hi Jason, 

Thanks for this email. We will evaluate your solution and implement it into future versions of ComBat if it truly does the trick!

Thanks!

Evan

On Nov 22, 2012, at 6:00 AM, bioconductor-request at r-project.org wrote:

> Date: Wed, 21 Nov 2012 11:00:33 -0800
> From: Jason Stein <steinja at ucla.edu>
> To: bioconductor at r-project.org
> Subject: [BioC] [ComBat] ComBat outputting NAs with Large Batches -
> 	Switch to Log Likelihood
> Message-ID: <20121121110033.256149mixyg20d75 at mail.ucla.edu>
> Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
> 	format="flowed"
> 
> Hi fellow ComBat fans,
> 
> Not sure if my previous message posted to the listserv because of the  
> attachment.  Now re-sending without attachment:
> 
> I was having a problem with ComBat when using large batch sizes and  
> the non-parametric parameter estimation.  Almost all probe intensity  
> values after ComBat correction would be NAs.
> 
> I've traced the source of this problem and have a solution (switching  
> from likelihood to log likelihood) described below.  I hope you all  
> find it helpful, or that you would let me know if you find any errors.  
>  Also, I can send the attachment which shows why I changed what I  
> changed if anyone would like to take a look.
> 
> ComBat correction
> 
> ComBat was sometimes outputting NAs for all probe values using  
> non-parametric parameter estimates.  I traced this to the likelihood  
> calculation in the int.eprior function.  The original function is  
> pasted below:
> 
> # Monte Carlo integration function to find the nonparametric  
> adjustments for a particular batch
> int.eprior <- function(sdat,g.hat,d.hat){
>   g.star <- d.star <- NULL
>   r <- nrow(sdat)
>   for(i in 1:r){
>     g <- g.hat[-i]
>     d <- d.hat[-i]
>     x <- sdat[i,!is.na(sdat[i,])]
>     n <- length(x)
>     j <- numeric(n)+1
>     dat <- matrix(as.numeric(x),length(g),n,byrow=T)
>     resid2 <- (dat-g)^2
>     sum2 <- resid2%*%j
>     LH <- 1/(2*pi*d)^(n/2)*exp(-sum2/(2*d))
>     LH[LH=="NaN"]=0
>     g.star <- c(g.star,sum(g*LH)/sum(LH))
>     d.star <- c(d.star,sum(d*LH)/sum(LH))
>     ##if(i%%1000==0){cat(i,'\n')}
>   }
>   adjust <- rbind(g.star,d.star)
>   rownames(adjust) <- c("g.star","d.star")
>   adjust
> }
> 
> The likelihood calculation LH here goes to zero when sum2 is very  
> large due to numerical precision errors.  This causes the parameter  
> estimates g.star and d.star to become NaN from a divide by zero error,  
> and consequently for the output of the program to be NAs. This can  
> happen when there are a lot of samples in a batch or when there are  
> large residuals (or both).  To correct this, can use the  
> log-likelihood, which is less susceptible to numerical precision  
> errors since it is not so small.
> 
> I changed the function to use log-likelihood and the entire edited  
> function is now pasted below:
> 
> # Monte Carlo integration function to find the nonparametric  
> adjustments for a particular batch
> int.eprior <- function(sdat,g.hat,d.hat){
>   #Get the number of probes
>   r <- nrow(sdat);
>   #The number of probes removing the probe of interest
>   G = r - 1;
>   #g.star and d.star are the output matrices
>   g.star <- d.star <- matrix(NA,nrow=G,ncol=1);
>   #Loop over each probe
>   for(i in 1:r){
>     #Remove the beta value of the probe of interest
>     g <- g.hat[-i];
>     #Remove the variance of the probe of interest
>     d <- d.hat[-i];
>     #Isolate the value of this probe across all samples
>     x <- sdat[i,!is.na(sdat[i,])];
>     #Number of samples in this batch
>     n <- length(x)
>     #A matrix of dimension gxn which contains the probe values  
> repeated in each row
>     dat <- matrix(as.numeric(x),length(g),n,byrow=T)
>     #The residual between the probe values and the regression coefficient
>     resid2 <- (dat-g)^2
>     #Calculate Log-likelihood using Gaussian
>     LL = matrix(0,nrow=length(G),ncol=1);
>     for (j in 1:n) {
>       LL = LL + -0.5*log(2*pi*d)-(resid2[,j]/(2*d));
>     }
>     #Order by the log-likelihood values to avoid Inf in parameter estimates
>     orderind = order(LL,decreasing=TRUE);
>     LL = LL[orderind];
>     g = g[orderind];
>     d = d[orderind];
>     #Determine the parameter estimates
>     numerator.g = 1;
>     denominator = 1;
>     numerator.d = 1;
>     for (k in 2:G) {
>       numerator.g = numerator.g + (g[k]/g[1]) * exp(LL[k]-LL[1]);
>       denominator = denominator + exp(LL[k]-LL[1]);
>       numerator.d = numerator.d + (d[k]/d[1]) * exp(LL[k]-LL[1]);
>     }
>     g.star[i] = g[1]*(numerator.g/denominator);
>     d.star[i] = d[1]*(numerator.d/denominator);
>   }
>   adjust <- rbind(g.star,d.star)
>   rownames(adjust) <- c("g.star","d.star")
>   adjust
> }