[Bioc-devel] Bug in rma: zero variance of probeset summaries across arrays
Wolfgang Huber
huber at ebi.ac.uk
Tue Feb 3 20:14:57 CET 2009
Thank you, Holger, Jim, Ben,
(1) this is indeed a peculiar feature of the medianpolish part of RMA. I
apologize to Ben for using the harsh word "bug" inappropriately in this
context. I was misled by observations that using different normalisation
or background correction methods seemed to "fix" the problem, and had
forgotten about those previous threads. Sorry.
I wonder whether anyone has done research on how to improve on
medianpolish for small datasets and why people haven't adapted it.
(2) One further potential problem remains, and I hope that bringing this
up might help to make sure that rma (which is obviously a core component
of Bioconductor) is robust even under the most adverse circumstances
(e.g. myself :). With the script below, I get a core dump. This is on a
freshly checked out and built R (from today) and all packages then
installed from scratch (sessionInfo as in previous mail below). It does
not seem to happen on R-2.8.1. and Bioc release 2.3., and even on R-dev,
whether it happens seems to depend on the history of the R session. One
hypothesis is that this could be related to the "destructive=TRUE"
parameter of rma?
Can anyone reproduce this?
Best wishes
Wolfgang
library("affy")
library("genefilter")
options(error=recover)
files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL", "PP_T2_2c.CEL")
#celdir = tempdir()
#for(f in files)
# download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
# file.path(celdir, f))
celdir="."
a = ReadAffy(filenames=file.path(celdir, files))
## Define a CDF with one probeset per probe
testcdf = new.env(hash=TRUE)
for(i in seq_len(nrow(exprs(a))))
assign(paste(i), matrix(i, ncol=1, nrow=1), envir=testcdf)
a at cdfName = "testcdf"
x = rma(a)
----------------------
Background correcting
Normalizing
Calculating Expression
Error in `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL", :
attempt to set colnames on object with less than two dimensions
Enter a frame number, or 0 to exit
1: source("rma2.R")
2: eval.with.vis(ei, envir)
3: eval.with.vis(expr, envir, enclos)
4: rma(a)
5: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",
"PP_T2_1.CEL"
Selection: 4
Called from: eval(expr, envir, enclos)
Browse[1]> ls()
[1] "*tmp*" "background" "bgversion" "cols" "destructive"
[6] "exprs" "ngenes" "normalize" "object" "pNList"
[11] "rows" "subset" "verbose"
Browse[1]> str(exprs)
........ long dump of text ....
496: stop("attempt to set colnames on object with less than two dimensions")
497: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",
"PP_T2_1.CEL", "PP_T2_2c.CEL"))
498: rma(a)
499: eval.with.vis(expr, envir, enclos)
500: eval.with.vis(ei, envir)
501: source("rma2.R")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:
James W. MacDonald wrote:
> Hi Holger,
>
> Holger Schwender wrote:
>> Hi Wolfgang,
>>
>> the difference between the two outputs x1 and x2 are due to that in
>> x1 the probe intensities are background-corrected first and then
>> normalized, while in x2 they are first normalized and then
>> background-corrected.
>>
>> The constant values are a feature of median polish: If median polish
>> is applied to four or six samples, then this can lead to constant
>> values for some of the probe sets. This has been mentioned years ago
>> by Jim (I couldn't find his original email on this topic in the
>> mailing list archive).
>
> Actually this only occurs with a small, odd number of samples, not even.
> Well, it could happen for a small number of probesets if the n/2 and n/2
> + 1 ordered values were identical, but it is much more common in the
> case of 3 or 5 samples.
>
> Best,
>
> Jim
>
>
>>
>> Best, Holger
>>
>>
>> -------- Original-Nachricht --------
>>> Datum: Tue, 03 Feb 2009 15:38:06 +0100 Von: Wolfgang Huber
>>> <huber at ebi.ac.uk> An: Bioconductor Developers
>>> <bioc-devel at stat.math.ethz.ch> Betreff: [Bioc-devel] Bug in rma:
>>> zero variance of probeset summaries across arrays
>>
>>> Hi,
>>>
>>> I am getting puzzling results from the rma function in the affy
>>> package, and furthermore they are different depending on whether
>>> quantile normalisation is called within rma or explicitely. The
>>> puzzling part is that some probesets, in a perfectly ordinary
>>> experiment with 4 different arrays, are summarized with exactly
>>> identical values across all arrays (i.e. zero variance); while the
>>> underlying probe level data for these probesets is variable and
>>> sits right in the middle of the dynamic range (i.e. not at any of
>>> the extremes).
>>>
>>> The code: ---------
>>>
>>> library("affy") library("genefilter") library("preprocessCore")
>>> options(error=recover)
>>>
>>> files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL",
>>> "PP_T2_2c.CEL")
>>>
>>> for(f in files)
>>> download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
>>> file.path(tempdir(), f))
>>>
>>> a = ReadAffy(filenames=file.path(tempdir(), files))
>>>
>>> x1 = rma(a)
>>>
>>> aq = a exprs(aq) = normalize.quantiles(exprs(aq)) x2 = rma(aq,
>>> normalize=FALSE)
>>>
>>> probeVars = function(x) { r = rowSds(exprs(x)) ps =
>>> names(which(r==0)) cat(sprintf("%s: %d probes with zero
>>> variance.\n", deparse(substitute(x)), length(ps))) return(ps) }
>>>
>>> probeVars(a) ps = probeVars(x1) probeVars(x2)
>>>
>>> print(exprs(a)[yeast2cdf[[ps[1]]][,1],])
>>>
>>> -----------------------------------------------------------------------------
>>>
>>>
>>>
>>> The output: ----------- Background correcting Normalizing Calculating
>>> Expression Background correcting Calculating Expression
>>>
>>>
>>> a: 0 probes with zero variance. x1: 3 probes with zero variance. x2:
>>> 0 probes with zero variance.
>>>
>>> PP_T1_1.CEL PP_T1_2.CEL PP_T2_1.CEL PP_T2_2c.CEL 35244 185
>>> 236 269 253 198153 226 314
>>> 320 213 48500 375 370 408
>>> 373 109094 125 157 201 175 144686
>>> 107 131 153 100 214110 181
>>> 296 307 234 47302 355 430
>>> 397 282 108656 117 130 150
>>> 118 147528 158 144 159 135 237691
>>> 208 244 253 169 229910 94
>>> 138 147 133
>>>
>>>> sessionInfo()
>>> R version 2.9.0 Under development (unstable) (2009-02-03 r47829)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C
>>>
>>>
>>>
>>> attached base packages: [1] stats graphics grDevices datasets
>>> utils methods base
>>>
>>> other attached packages: [1] yeast2cdf_2.3.0
>>> preprocessCore_1.5.3 genefilter_1.23.1 [4] affy_1.21.7
>>> Biobase_2.3.10 fortunes_1.3-6
>>>
>>> loaded via a namespace (and not attached): [1] affyio_1.11.3
>>> annotate_1.21.3 AnnotationDbi_1.5.14 [4] DBI_0.2-4
>>> RSQLite_0.7-1 splines_2.9.0 [7] survival_2.34-1
>>> tools_2.9.0 xtable_1.5-4
>>>
>>>
>>>
>>> Best wishes Wolfgang
>>>
>>>
>>> PS While trying to dissect this, I also experimented with
>>> constructing special CDF environments (e.g. one that had one
>>> 'probeset' for each probe), and I noted that repeatedly the C code
>>> of rma caused core dumps. Perhaps a better response to unexpected
>>> shapes of CDF environments would be to issue an error message. I
>>> can provide details if interested.
>>>
>>> ---------------------------------------------------- Wolfgang
>>> Huber, EMBL-EBI, http://www.ebi.ac.uk/huber
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> --
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
----------------------------------------------------
Wolfgang Huber EMBL-EBI http://www.ebi.ac.uk/huber
More information about the Bioc-devel
mailing list