[BioC] Peculiar behaviour of normalize.quantiles (in affy, preprocessCore) if there are NA data

Wed Jul 11 01:48:28 CEST 2007

Wolfgang,

The code in preprocessCore for quantile normalization shows its legacy
being that it was developed around probe-level Affymetrix data straight
from CEL files where NA values are not to be expected. There may or may
not be comments to that effect in the C code documentation (actually
there is further down in the qnorm.c file for a slight variation on the
implementation).

If you are willing to make the assumption that the missing data
mechanism is "missing at random" then I think the fix is fairly trivial,
just estimate the distribution using the non-missing data. If it is
instead driven by say a truncation mechanism a different fix would be
needed.

In either case I don't think the current situation is desirable and
should be fixed.

Best,

Ben

On Tue, 2007-07-10 at 18:35 +0100, Wolfgang Huber wrote:
> Hi all,
> 
> I noted a peculiar result from using quantile normalisation on a data
> matrix that contained NA values. It creates a rather artifactual-looking
> distribution of the resulting data, and I wonder whether:
> - this is desired,
> - if not, how it can be fixed,
> - in either case, whether this is a point of general interest for people
> that interpret distributions of their e.g. microarray data.
> 
> Here is some example code to reproduce:
> 
> 
> 
> library("geneplotter")
> library("preprocessCore")
> 
> set.seed(0xbeef)
> 
> x = matrix(as.numeric(NA), nrow=20000, ncol=2)
> for(i in 1:ncol(x))
>  x[,i] = c(rnorm(10000), runif(10000)*10)
> x[ sample(nrow(x), 1000), ncol(x)] = NA
> 
> qx = normalize.quantiles(x)
> 
> par(mfrow=c(2,2))
> 
> for(what in c("x", "qx"))
>   for(i in 1:2)
>     hist(get(what)[,i], breaks=seq(-5,10, length=75),
>          main=sprintf("%s[,%d]", what, i),
>          col="orange", xlab="")
> 
> 
> 
> 
> 
> The resulting plot is here
> http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantiles.png
> 
> I noted in the implementation in preprocessCore/src/qnorm.c that no
> special consideration is made for NA values, maybe does this confuse the
> algorithm?
> 
> 
> R version 2.6.0 Under development (unstable) (2007-07-10 r42165)
> x86_64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] tools     stats     graphics  grDevices datasets  utils     methods
> [8] base
> 
> other attached packages:
> [1] preprocessCore_0.99.8 geneplotter_1.15.1    lattice_0.16-1
> [4] annotate_1.15.2       AnnotationDbi_0.0.78  RSQLite_0.5-4
> [7] DBI_0.2-3             Biobase_1.15.17       fortunes_1.3-3
> 
> loaded via a namespace (and not attached):
> [1] grid_2.6.0         KernSmooth_2.22-20 RColorBrewer_0.2-3
> >
> 
> 
> Best wishes
>  Wolfgang
> 
> ------------------------------------------------------------------
> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor