[Bioc-devel] [BioC] Peculiar behaviour of normalize.quantiles (in affy, preprocessCore) if there are NA data

Wolfgang Huber huber at ebi.ac.uk
Wed Jul 11 19:02:04 CEST 2007


Hi Seth & Ben,

thanks for your clarifying comments!

> [moved to bioc-devel, where this should have started I think]

Sorry if I have been stepping on feet... the reason for posting to the
bioc user list was that more than once I have (sadly) seen people
looking at histogrammes such as that of qx shown in my previous post,
and using the suggested "cutoff" e.g. to discriminate between expressed
and un-expressed genes, and the like. I hope that this does not sound to
presumptuous, but I think it is a good thing to educate users to
critically assess such results.

Btw, normalizeQuantiles from the limma package appears to deal with NA
values more gracefully (but it is written in R, hence slower). I think
it assumes that the missingness mechanism is random.

An ideal version of Bioconductor, of course, would also minimize the
potential to confuse users by differences in results and performance of
limma:normalizeQuantiles and affy/preprocessCore:normalize.quantiles.

	Best wishes
	Wolfgang



> 
> Ben Bolstad <bmb at bmbolstad.com> writes:
> 
>> Wolfgang,
>>
>> The code in preprocessCore for quantile normalization shows its legacy
>> being that it was developed around probe-level Affymetrix data straight
>> from CEL files where NA values are not to be expected. There may or may
>> not be comments to that effect in the C code documentation (actually
>> there is further down in the qnorm.c file for a slight variation on the
>> implementation).
>>
>> If you are willing to make the assumption that the missing data
>> mechanism is "missing at random" then I think the fix is fairly trivial,
>> just estimate the distribution using the non-missing data. If it is
>> instead driven by say a truncation mechanism a different fix would be
>> needed.
>>
>> In either case I don't think the current situation is desirable and
>> should be fixed.
> 
> How about:
> 
> 1. Let's add code to check for and raise an error if any NA's are
>    found.  This should be easy and can be done quickly.
> 
> 2. Then we could consider adding an argument that allows NA's and
>    handles things under the missing at random assumption, along with
>    documentation.
> 
> + seth
> 
> 
> 
>> Best,
>>
>> Ben
>>
>>
>>
>>
>>
>> On Tue, 2007-07-10 at 18:35 +0100, Wolfgang Huber wrote:
>>> Hi all,
>>>
>>> I noted a peculiar result from using quantile normalisation on a data
>>> matrix that contained NA values. It creates a rather artifactual-looking
>>> distribution of the resulting data, and I wonder whether:
>>> - this is desired,
>>> - if not, how it can be fixed,
>>> - in either case, whether this is a point of general interest for people
>>> that interpret distributions of their e.g. microarray data.
>>>
>>> Here is some example code to reproduce:
>>>
>>>
>>>
>>> library("geneplotter")
>>> library("preprocessCore")
>>>
>>> set.seed(0xbeef)
>>>
>>> x = matrix(as.numeric(NA), nrow=20000, ncol=2)
>>> for(i in 1:ncol(x))
>>>  x[,i] = c(rnorm(10000), runif(10000)*10)
>>> x[ sample(nrow(x), 1000), ncol(x)] = NA
>>>
>>> qx = normalize.quantiles(x)
>>>
>>> par(mfrow=c(2,2))
>>>
>>> for(what in c("x", "qx"))
>>>   for(i in 1:2)
>>>     hist(get(what)[,i], breaks=seq(-5,10, length=75),
>>>          main=sprintf("%s[,%d]", what, i),
>>>          col="orange", xlab="")
>>>
>>>
>>>
>>>
>>>
>>> The resulting plot is here
>>> http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantiles.png
>>>
>>> I noted in the implementation in preprocessCore/src/qnorm.c that no
>>> special consideration is made for NA values, maybe does this confuse the
>>> algorithm?
>>>
>>>
>>> R version 2.6.0 Under development (unstable) (2007-07-10 r42165)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] tools     stats     graphics  grDevices datasets  utils     methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] preprocessCore_0.99.8 geneplotter_1.15.1    lattice_0.16-1
>>> [4] annotate_1.15.2       AnnotationDbi_0.0.78  RSQLite_0.5-4
>>> [7] DBI_0.2-3             Biobase_1.15.17       fortunes_1.3-3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] grid_2.6.0         KernSmooth_2.22-20 RColorBrewer_0.2-3
>>>
>>> Best wishes
>>>  Wolfgang
>>>
>>> ------------------------------------------------------------------
>>> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 


-- 
Best wishes
 Wolfgang

------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber



More information about the Bioc-devel mailing list