[BioC] Quantile normalization vs. data distributions

paul.boutros at utoronto.ca paul.boutros at utoronto.ca
Mon Mar 15 20:27:24 MET 2004


We've been testing something similar.  We:
a) center each array around 0 and scale to 1 SD
b) compute kernel-densities for each array
c) perform all pairwise comparisons between arrays, using area under both 
curves as a similarity metric
d) Manually verify the most extreme outliers (e.g. the pairs of arrays with the 
smallest common area)

This seems to work okay for us.  As you say, any direct distributional test 
with large arrays always finds significant differences in our hands.

Paul

Date: Mon, 15 Mar 2004 10:04:57 -0500 
From: Naomi Altman <naomi at stat.psu.edu> 
Subject: Re: [BioC] Quantile normalization vs. data distributions 
To: "Stan Smiley" <swsmiley at genetics.utah.edu>,        "Bioconductor Mailing 
        list" <bioconductor at stat.math.ethz.ch> 
Message-ID: <6.0.0.22.2.20040314225049.01d7ffb8 at stat.psu.edu> 
Content-Type: text/plain; charset="us-ascii"; format=flowed 

This is a very good question that I have also been puzzling over.  It seems 
useless to try 
tests of equality of the distribution such as Kolmogorov-Smirnov- due to 
the huge sample size you 
would almost certainly get a significant result. 

Currently, I am using the following graphical method: 

1. I compute a kernel density estimate of the combined data of all probes 
on all the arrays. 
2. I compute a kernel density estimate of the data for each array. 
3. I plot both smooths on the same plot, and decide if they are the same. 

Looking at what I wrote above, I think it would be better in steps 1 and 2 
to background correct and 
center each array before combining.  It might also be between to reduce the 
data to standardized scores before combining, unless 
you think that the overall scaling is due to your "treatment effect". 

It seems like half of what I do is ad hoc, so I always welcome any 
criticisms or suggestions. 

--Naomi Altman 

At 06:07 PM 3/11/2004, Stan Smiley wrote: 
>Greetings, 
> 
>I have been trying to find a quantitative measure to tell when the data 
>distributions 
>between chips are 'seriously' different enough from each other to violate 
>the 
>assumptions behind quantile normalization. I've been through the archives 
>and seen some discussion of this matter, but didn't come away with a 
>quantitative measure I 
>could apply to my data sets to assure me that it would be OK to use quantile 
>normalization. 
> 
> 
>"Quantile normalization uses a single standard for all chips, however it 
>assumes that no serious change in distribution occurs" 
> 
>Could someone please point me in the right direction on this? 
> 
>Thanks. 
> 
>Stan Smiley 
>stan.smiley at genetics.utah.edu 
> 
>_______________________________________________ 
>Bioconductor mailing list 
>Bioconductor at stat.math.ethz.ch 
>https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor 

Naomi S. Altman                                814-865-3791 (voice) 
Associate Professor 
Bioinformatics Consulting Center 
Dept. of Statistics                              814-863-7114 (fax) 
Penn State University                         814-865-1348 (Statistics) 
University Park, PA 16802-2111



More information about the Bioconductor mailing list