[BioC] quantile normalization
Gordon Smyth
smyth at wehi.edu.au
Fri Jun 6 00:28:40 MEST 2003
I had a play around with 'normalizeQuantiles' to see how easy it is to give
the "correct" treatment of ties. It's not too difficult, but there is a
performance penalty, so I've made it a non-default option.
> A <- cbind(c(1,2,3,4,5),c(2,4,4,6,7))
> normalizeQuantiles(A,ties=TRUE)
[,1] [,2]
[1,] 1.5 1.50
[2,] 3.0 3.25
[3,] 3.5 3.25
[4,] 5.0 5.00
[5,] 6.0 6.00
> A <- cbind(c(1,2,3,4,5),c(2,4,4,4,7))
> normalizeQuantiles(A,ties=TRUE)
[,1] [,2]
[1,] 1.5 1.5
[2,] 3.0 3.5
[3,] 3.5 3.5
[4,] 4.0 3.5
[5,] 6.0 6.0
The 'normalizeQuantiles' function also handles missing values, assumed
'missing at random':
> A
[,1] [,2]
[1,] 1 2
[2,] NA 4
[3,] 3 4
[4,] 4 6
[5,] 5 7
> normalizeQuantiles(A,ties=TRUE)
[,1] [,2]
[1,] 1.500000 1.500
[2,] NA 3.500
[3,] 3.416667 3.500
[4,] 4.666667 5.125
[5,] 6.000000 6.000
Here are some timings. Handling the ties carefully increases the time taken
several fold:
> A <- matrix(rnorm(30000*100),30000,100) # 100 arrays with 30,000 spots each
> system.time(B <- normalizeQuantiles(A))
[1] 3.62 0.00 3.98 NA NA
> system.time(B <- normalizeQuantiles(A,ties=TRUE))
[1] 12.25 0.01 12.39 NA NA
> system.time(B <- normalize.quantiles(A))
[1] 6.42 0.07 6.50 NA NA
As pointed out by Ben, normalize.quantiles already gives the "correct"
treatment of ties when the number of ties is odd.
Cheers
Gordon
At 11:59 AM 4/06/2003, Gordon Smyth wrote:
>A good point! Although probably not of great practical importance for most
>microarray experiments.
>
>I think the correct in-principle behavior of quantile normalization is
>clear. In the case of ties, one should in principle set each tied
>log-intensity to the average of the corresponding pooled quantiles. There
>are two quantile normalization routines in bioconductor at the moment (one
>intended for cDNA arrays and one for affy) and neither does exactly the
>right thing. limma breaks ties using by preserving index order while affy
>treats all tied values as having the minimum of the tied ranks:
>
> > A
> [,1] [,2]
>[1,] 1 2
>[2,] 2 4
>[3,] 3 4
>[4,] 4 6
> > library(limma)
> > normalizeQuantiles(A)
> [,1] [,2]
>[1,] 1.5 1.5
>[2,] 3.0 3.0
>[3,] 3.5 3.5
>[4,] 5.0 5.0
> > library(affy)
> > normalize.quantiles(A)
> [,1] [,2]
>[1,] 1.5 1.5
>[2,] 3.0 3.0
>[3,] 3.5 3.0
>[4,] 5.0 5.0
>
>In principle, I think the correct quantile normalized matrix should be
> [,1] [,2]
>[1,] 1.5 1.5
>[2,] 3.0 3.25
>[3,] 3.5 3.25
>[4,] 5.0 5.0
>
>Cheers
>Gordon
>
>At 02:54 AM 4/06/2003, Martino Barenco wrote:
>>Hi,
>>
>>My understanding of quantile normalization is that values for several
>>data sets are ranked, then the average per rank is taken and is
>>reattributed to each data set according to the original rank (hope this
>>makes sense). My question is: how does one deal when ties occur?
>>- One possibility would be to force a tied value to be greater than the
>>other (using the "order" command instead of "rank"). Even though it would
>>not make a huge difference it is a bit arbitrary.
>>- Another possibility is to "force" the ties across all data sets, quite
>>arbitrary as well.
>>- or maybe a combination of both, but it looks like a nightmare to program!
>>
>>So, what should be done? Also, is there a method such as "normalize" that
>>acts on exprSet rather than AffyBatch?
>>
>>Thanks
>>
>>Martino
>>---------------------------------------
>>Martino Barenco
>>CoMPLEX
>>4, Stephenson Way
>>London NW1 2HE
>>Tel.: +44 20 7679 5088
>>Fax.: +44 20 7383 5519
>>Email: m.barenco at ucl.ac.uk
More information about the Bioconductor
mailing list