[BioC] Peculiar behaviour of normalize.quantiles (in affy, preprocessCore) if there are NA data
Ben Bolstad
bmb at bmbolstad.com
Wed Jul 11 01:48:28 CEST 2007
Wolfgang,
The code in preprocessCore for quantile normalization shows its legacy
being that it was developed around probe-level Affymetrix data straight
from CEL files where NA values are not to be expected. There may or may
not be comments to that effect in the C code documentation (actually
there is further down in the qnorm.c file for a slight variation on the
implementation).
If you are willing to make the assumption that the missing data
mechanism is "missing at random" then I think the fix is fairly trivial,
just estimate the distribution using the non-missing data. If it is
instead driven by say a truncation mechanism a different fix would be
needed.
In either case I don't think the current situation is desirable and
should be fixed.
Best,
Ben
On Tue, 2007-07-10 at 18:35 +0100, Wolfgang Huber wrote:
> Hi all,
>
> I noted a peculiar result from using quantile normalisation on a data
> matrix that contained NA values. It creates a rather artifactual-looking
> distribution of the resulting data, and I wonder whether:
> - this is desired,
> - if not, how it can be fixed,
> - in either case, whether this is a point of general interest for people
> that interpret distributions of their e.g. microarray data.
>
> Here is some example code to reproduce:
>
>
>
> library("geneplotter")
> library("preprocessCore")
>
> set.seed(0xbeef)
>
> x = matrix(as.numeric(NA), nrow=20000, ncol=2)
> for(i in 1:ncol(x))
> x[,i] = c(rnorm(10000), runif(10000)*10)
> x[ sample(nrow(x), 1000), ncol(x)] = NA
>
> qx = normalize.quantiles(x)
>
> par(mfrow=c(2,2))
>
> for(what in c("x", "qx"))
> for(i in 1:2)
> hist(get(what)[,i], breaks=seq(-5,10, length=75),
> main=sprintf("%s[,%d]", what, i),
> col="orange", xlab="")
>
>
>
>
>
> The resulting plot is here
> http://www.ebi.ac.uk/~huber/quantilenormalisation/normalize.quantiles.png
>
> I noted in the implementation in preprocessCore/src/qnorm.c that no
> special consideration is made for NA values, maybe does this confuse the
> algorithm?
>
>
> R version 2.6.0 Under development (unstable) (2007-07-10 r42165)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] tools stats graphics grDevices datasets utils methods
> [8] base
>
> other attached packages:
> [1] preprocessCore_0.99.8 geneplotter_1.15.1 lattice_0.16-1
> [4] annotate_1.15.2 AnnotationDbi_0.0.78 RSQLite_0.5-4
> [7] DBI_0.2-3 Biobase_1.15.17 fortunes_1.3-3
>
> loaded via a namespace (and not attached):
> [1] grid_2.6.0 KernSmooth_2.22-20 RColorBrewer_0.2-3
> >
>
>
> Best wishes
> Wolfgang
>
> ------------------------------------------------------------------
> Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list