[BioC] Question about quantile normalization and NA value

Thu Jan 23 04:52:34 CET 2014

At least for the example matrix below, you’ll find the preprocessCore normalize.quantiles() function will generate you the same result as below from limma. Though I make no claims that it is identical in other cases, nor that its treatment of NA is better than any other implementations.

Best,

Ben

On Jan 22, 2014, at 7:16 PM, Gordon K Smyth <smyth at wehi.EDU.AU> wrote:

> The meaning of quantile normalization with NAs have never been agreed on in a refereed publication, as far as I know. I implemented the limma version long ago, and as far as I know it was the first implementation of quantile normalization to allow NAs.  Ben Bolstad implemented a somewhat different algorithm in the affy package.  Ben's version is now in the preprocessCore package as normalize.quantiles().
> 
> The result you have is correct according to limma's algorithm, which involves interpolating each column of non-missing values out a full length vector when computing the mean quantiles.  The reason the NA makes a big difference is that it changes the minimum quantile for column 2 from 16.5 to 110, a big change.  As an alternative, you might try Ben's algorithm:
> 
>   library(proprocessCore)
>   normalize.quantiles(y)
> 
> But replacing NAs with row medians would not in general be sufficient.
> 
> Best wishes
> Gordon
> 
>> Date: Tue, 21 Jan 2014 05:03:17 -0800 (PST)
>> From: H at mamba.fhcrc.org, "K [guest]" <guest at bioconductor.org>
>> To: bioconductor at r-project.org, godahajime at zoho.com
>> Subject: [BioC] Question about quantile normalization and NA value
>> 
>> 
>> Dear all,
>> 
>> I have a quation about quantile normalization and NA value.
>> 
>> I'm going to normalize the microarray data by "normalizeBetweenArrays" which is the quantile normalization function in "limma" package.
>> I normalized a data with NA as follows:
>> 
>>> x <- matrix(c(100,15,200,250,110,16.5,220,275,120,18,240,300),4,3)
>>> colnames(x) <- paste("Chip",1:3, sep="")
>>> rownames(x) <- c("RNA-A","RNA-B","RNA-C","RNA-D")
>>> 
>>> x
>>     Chip1 Chip2 Chip3
>> RNA-A   100 110.0   120
>> RNA-B    15  16.5    18
>> RNA-C   200 220.0   240
>> RNA-D   250 275.0   300
>>> 
>>> normalizeBetweenArrays(x)
>>     Chip1 Chip2 Chip3
>> RNA-A 110.0 110.0 110.0
>> RNA-B  16.5  16.5  16.5
>> RNA-C 220.0 220.0 220.0
>> RNA-D 275.0 275.0 275.0
>>> 
>>> y <- x
>>> y[2,2] <- NA
>>> 
>>> normalizeBetweenArrays(y)
>>         Chip1     Chip2     Chip3
>> RNA-A 134.44444  47.66667 134.44444
>> RNA-B  47.66667        NA  47.66667
>> RNA-C 226.11111 180.27778 226.11111
>> RNA-D 275.00000 275.00000 275.00000
>> 
>> 
>> I asuume the normalized y is a bit far away from normalized y. Does only one NA induce this large effect ?
>> Should I normalize after replacing NA with some value, such as median(x[2,],na.rm=T) ?
>> My environment is limma Version 3.16.6, R version 3.0.1.
>> 
>> Thanks
> 
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:4}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor