[BioC] quantile normalization
    Gordon Smyth 
    smyth at wehi.edu.au
       
    Fri Jun  6 00:28:40 MEST 2003
    
    
  
I had a play around with 'normalizeQuantiles' to see how easy it is to give 
the "correct" treatment of ties. It's not too difficult, but there is a 
performance penalty, so I've made it a non-default option.
 > A <- cbind(c(1,2,3,4,5),c(2,4,4,6,7))
 > normalizeQuantiles(A,ties=TRUE)
      [,1] [,2]
[1,]  1.5 1.50
[2,]  3.0 3.25
[3,]  3.5 3.25
[4,]  5.0 5.00
[5,]  6.0 6.00
 > A <- cbind(c(1,2,3,4,5),c(2,4,4,4,7))
 > normalizeQuantiles(A,ties=TRUE)
      [,1] [,2]
[1,]  1.5  1.5
[2,]  3.0  3.5
[3,]  3.5  3.5
[4,]  4.0  3.5
[5,]  6.0  6.0
The 'normalizeQuantiles' function also handles missing values, assumed 
'missing at random':
 > A
      [,1] [,2]
[1,]    1    2
[2,]   NA    4
[3,]    3    4
[4,]    4    6
[5,]    5    7
 > normalizeQuantiles(A,ties=TRUE)
          [,1]  [,2]
[1,] 1.500000 1.500
[2,]       NA 3.500
[3,] 3.416667 3.500
[4,] 4.666667 5.125
[5,] 6.000000 6.000
Here are some timings. Handling the ties carefully increases the time taken 
several fold:
 > A <- matrix(rnorm(30000*100),30000,100)  # 100 arrays with 30,000 spots each
 > system.time(B <- normalizeQuantiles(A))
[1] 3.62 0.00 3.98   NA   NA
 > system.time(B <- normalizeQuantiles(A,ties=TRUE))
[1] 12.25  0.01 12.39    NA    NA
 > system.time(B <- normalize.quantiles(A))
[1] 6.42 0.07 6.50   NA   NA
As pointed out by Ben, normalize.quantiles already gives the "correct" 
treatment of ties when the number of ties is odd.
Cheers
Gordon
At 11:59 AM 4/06/2003, Gordon Smyth wrote:
>A good point! Although probably not of great practical importance for most 
>microarray experiments.
>
>I think the correct in-principle behavior of quantile normalization is 
>clear. In the case of ties, one should in principle set each tied 
>log-intensity to the average of the corresponding pooled quantiles. There 
>are two quantile normalization routines in bioconductor at the moment (one 
>intended for cDNA arrays and one for affy) and neither does exactly the 
>right thing. limma breaks ties using by preserving index order while affy 
>treats all tied values as having the minimum of the tied ranks:
>
> > A
>      [,1] [,2]
>[1,]    1    2
>[2,]    2    4
>[3,]    3    4
>[4,]    4    6
> > library(limma)
> > normalizeQuantiles(A)
>      [,1] [,2]
>[1,]  1.5  1.5
>[2,]  3.0  3.0
>[3,]  3.5  3.5
>[4,]  5.0  5.0
> > library(affy)
> > normalize.quantiles(A)
>      [,1] [,2]
>[1,]  1.5  1.5
>[2,]  3.0  3.0
>[3,]  3.5  3.0
>[4,]  5.0  5.0
>
>In principle, I think the correct quantile normalized matrix should be
>      [,1] [,2]
>[1,]  1.5  1.5
>[2,]  3.0  3.25
>[3,]  3.5  3.25
>[4,]  5.0  5.0
>
>Cheers
>Gordon
>
>At 02:54 AM 4/06/2003, Martino Barenco wrote:
>>Hi,
>>
>>My understanding of quantile normalization is that values for several 
>>data sets are ranked, then the average per rank is taken and is 
>>reattributed to each data set according to the original rank (hope this 
>>makes sense). My question is: how does one deal when ties occur?
>>- One possibility would be to force a tied value to be greater than the 
>>other (using the "order" command instead of "rank"). Even though it would 
>>not make a huge difference it is a bit arbitrary.
>>- Another possibility is to "force" the ties across all data sets, quite 
>>arbitrary as well.
>>- or maybe a combination of both, but it looks like a nightmare to program!
>>
>>So, what should be done? Also, is there a method such as "normalize" that 
>>acts on exprSet rather than AffyBatch?
>>
>>Thanks
>>
>>Martino
>>---------------------------------------
>>Martino Barenco
>>CoMPLEX
>>4, Stephenson Way
>>London NW1 2HE
>>Tel.: +44 20 7679 5088
>>Fax.: +44 20 7383 5519
>>Email: m.barenco at ucl.ac.uk
    
    
More information about the Bioconductor
mailing list