[Bioc-devel] [BioC] Peculiar behaviour of normalize.quantiles (in affy, preprocessCore) if there are NA data

Seth Falcon sfalcon at fhcrc.org
Wed Jul 11 16:04:48 CEST 2007


[moved to bioc-devel, where this should have started I think]

Ben Bolstad <bmb at bmbolstad.com> writes:

> Wolfgang,
>
> The code in preprocessCore for quantile normalization shows its legacy
> being that it was developed around probe-level Affymetrix data straight
> from CEL files where NA values are not to be expected. There may or may
> not be comments to that effect in the C code documentation (actually
> there is further down in the qnorm.c file for a slight variation on the
> implementation).
>
> If you are willing to make the assumption that the missing data
> mechanism is "missing at random" then I think the fix is fairly trivial,
> just estimate the distribution using the non-missing data. If it is
> instead driven by say a truncation mechanism a different fix would be
> needed.
>
> In either case I don't think the current situation is desirable and
> should be fixed.

How about:

1. Let's add code to check for and raise an error if any NA's are
   found.  This should be easy and can be done quickly.

2. Then we could consider adding an argument that allows NA's and
   handles things under the missing at random assumption, along with
   documentation.

+ seth



>
> Best,
>
> Ben
>
>
>
>
>
> On Tue, 2007-07-10 at 18:35 +0100, Wolfgang Huber wrote:
>> Hi all,
>> 
>> I noted a peculiar result from using quantile normalisation on a data
>> matrix that contained NA values. It creates a rather artifactual-looking
>> distribution of the resulting data, and I wonder whether:
>> - this is desired,
>> - if not, how it can be fixed,
>> - in either case, whether this is a point of general interest for people
>> that interpret distributions of their e.g. microarray data.
>> 
>> Here is some example code to reproduce:
>> 
>> 
>> 
>> library("geneplotter")
>> library("preprocessCore")
>> 
>> set.seed(0xbeef)
>> 
>> x = matrix(as.numeric(NA), nrow=20000, ncol=2)
>> for(i in 1:ncol(x))
>>  x[,i] = c(rnorm(10000), runif(10000)*10)
>> x[ sample(nrow(x), 1000), ncol(x)] = NA
>> 
>> qx = normalize.quantiles(x)
>> 
>> par(mfrow=c(2,2))
>> 
>> for(what in c("x", "qx"))
>>   for(i in 1:2)
>>     hist(get(what)[,i], breaks=seq(-5,10, length=75),
>>          main=sprintf("%s[,%d]", what, i),
>>          col="orange", xlab="")
>> 
>> 
>> 
>> 
>> 
>> The resulting plot is here
>> http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantiles.png
>> 
>> I noted in the implementation in preprocessCore/src/qnorm.c that no
>> special consideration is made for NA values, maybe does this confuse the
>> algorithm?
>> 
>> 
>> R version 2.6.0 Under development (unstable) (2007-07-10 r42165)
>> x86_64-unknown-linux-gnu
>> 
>> locale:
>> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] tools     stats     graphics  grDevices datasets  utils     methods
>> [8] base
>> 
>> other attached packages:
>> [1] preprocessCore_0.99.8 geneplotter_1.15.1    lattice_0.16-1
>> [4] annotate_1.15.2       AnnotationDbi_0.0.78  RSQLite_0.5-4
>> [7] DBI_0.2-3             Biobase_1.15.17       fortunes_1.3-3
>> 
>> loaded via a namespace (and not attached):
>> [1] grid_2.6.0         KernSmooth_2.22-20 RColorBrewer_0.2-3
>> >
>> 
>> 
>> Best wishes
>>  Wolfgang
>> 
>> ------------------------------------------------------------------
>> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center
http://bioconductor.org



More information about the Bioc-devel mailing list