[Bioc-devel] Bug in rma: zero variance of probeset summaries across arrays

Henrik Bengtsson hb at stat.berkeley.edu
Tue Feb 3 20:26:38 CET 2009


Hi,

On Tue, Feb 3, 2009 at 11:14 AM, Wolfgang Huber <huber at ebi.ac.uk> wrote:
> Thank you, Holger, Jim, Ben,
>
> (1) this is indeed a peculiar feature of the medianpolish part of RMA. I
> apologize to Ben for using the harsh word "bug" inappropriately in this
> context. I was misled by observations that using different normalisation or
> background correction methods seemed to "fix" the problem, and had forgotten
> about those previous threads. Sorry.
>
> I wonder whether anyone has done research on how to improve on medianpolish
> for small datasets and why people haven't adapted it.

For the purpose of fitting the RMA-style log-additive model, I'd say
that Ben's robust estimators implemented in preprocessCore are much
better (and more flexible, e.g. support weights) than using median
polish.  See

   help("rcModelPLM", package="preprocessCore")

/Henrik

>
> (2) One further potential problem remains, and I hope that bringing this up
> might help to make sure that rma (which is obviously a core component of
> Bioconductor) is robust even under the most adverse circumstances (e.g.
> myself :). With the script below, I get a core dump. This is on a freshly
> checked out and built R (from today) and all packages then installed from
> scratch (sessionInfo as in previous mail below). It does not seem to happen
> on R-2.8.1. and Bioc release 2.3., and even on R-dev, whether it happens
> seems to depend on the history of the R session. One hypothesis is that this
> could be related to the "destructive=TRUE" parameter of rma?
>
> Can anyone reproduce this?
>
>  Best wishes
>        Wolfgang
>
>
> library("affy")
> library("genefilter")
> options(error=recover)
>
> files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL", "PP_T2_2c.CEL")
>
> #celdir = tempdir()
> #for(f in files)
> #  download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
> #                file.path(celdir, f))
> celdir="."
>
> a = ReadAffy(filenames=file.path(celdir, files))
>
> ## Define a CDF with one probeset per probe
> testcdf = new.env(hash=TRUE)
> for(i in seq_len(nrow(exprs(a))))
>    assign(paste(i), matrix(i, ncol=1, nrow=1), envir=testcdf)
> a at cdfName = "testcdf"
>
> x = rma(a)
>
>
> ----------------------
>
> Background correcting
> Normalizing
> Calculating Expression
> Error in `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",  :
>  attempt to set colnames on object with less than two dimensions
>
> Enter a frame number, or 0 to exit
>
> 1: source("rma2.R")
> 2: eval.with.vis(ei, envir)
> 3: eval.with.vis(expr, envir, enclos)
> 4: rma(a)
> 5: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",
> "PP_T2_1.CEL"
>
> Selection:  4
> Called from: eval(expr, envir, enclos)
> Browse[1]>  ls()
>  [1] "*tmp*"       "background"  "bgversion"   "cols"        "destructive"
>  [6] "exprs"       "ngenes"      "normalize"   "object"      "pNList"
> [11] "rows"        "subset"      "verbose"
> Browse[1]> str(exprs)
>
> ........ long dump of text ....
>
>
> 496: stop("attempt to set colnames on object with less than two dimensions")
> 497: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",
> "PP_T2_1.CEL", "PP_T2_2c.CEL"))
> 498: rma(a)
> 499: eval.with.vis(expr, envir, enclos)
> 500: eval.with.vis(ei, envir)
> 501: source("rma2.R")
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
> Selection:
>
>
> James W. MacDonald wrote:
>>
>> Hi Holger,
>>
>> Holger Schwender wrote:
>>>
>>> Hi Wolfgang,
>>>
>>> the difference between the two outputs x1 and x2 are due to that in
>>> x1 the probe intensities are background-corrected first and then
>>> normalized, while in x2 they are first normalized and then
>>> background-corrected.
>>>
>>> The constant values are a feature of median polish: If median polish
>>> is applied to four or six samples, then this can lead to constant
>>> values for some of the probe sets. This has been mentioned years ago
>>> by Jim (I couldn't find his original email on this topic in the
>>> mailing list archive).
>>
>> Actually this only occurs with a small, odd number of samples, not even.
>> Well, it could happen for a small number of probesets if the n/2 and n/2 + 1
>> ordered values were identical, but it is much more common in the case of 3
>> or 5 samples.
>>
>> Best,
>>
>> Jim
>>
>>
>>>
>>> Best, Holger
>>>
>>>
>>> -------- Original-Nachricht --------
>>>>
>>>> Datum: Tue, 03 Feb 2009 15:38:06 +0100 Von: Wolfgang Huber
>>>> <huber at ebi.ac.uk> An: Bioconductor Developers
>>>> <bioc-devel at stat.math.ethz.ch> Betreff: [Bioc-devel] Bug in rma:
>>>> zero variance of probeset summaries across    arrays
>>>
>>>> Hi,
>>>>
>>>> I am getting puzzling results from the rma function in the affy
>>>> package, and furthermore they are different depending on whether
>>>> quantile normalisation is called within rma or explicitely. The
>>>> puzzling part is that some probesets, in a perfectly ordinary
>>>> experiment with 4 different arrays, are summarized with exactly
>>>> identical values across all arrays (i.e. zero variance); while the
>>>> underlying probe level data for these probesets is variable and
>>>> sits right in the middle of the dynamic range (i.e. not at any of
>>>> the extremes).
>>>>
>>>> The code: ---------
>>>>
>>>> library("affy") library("genefilter") library("preprocessCore")
>>>> options(error=recover)
>>>>
>>>> files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL",
>>>> "PP_T2_2c.CEL")
>>>>
>>>> for(f in files)
>>>> download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
>>>> file.path(tempdir(), f))
>>>>
>>>> a = ReadAffy(filenames=file.path(tempdir(), files))
>>>>
>>>> x1 = rma(a)
>>>>
>>>> aq = a exprs(aq) = normalize.quantiles(exprs(aq)) x2 = rma(aq,
>>>> normalize=FALSE)
>>>>
>>>> probeVars = function(x) { r = rowSds(exprs(x)) ps =
>>>> names(which(r==0)) cat(sprintf("%s: %d probes with zero
>>>> variance.\n", deparse(substitute(x)), length(ps))) return(ps) }
>>>>
>>>> probeVars(a) ps = probeVars(x1) probeVars(x2)
>>>>
>>>> print(exprs(a)[yeast2cdf[[ps[1]]][,1],])
>>>>
>>>>
>>>> -----------------------------------------------------------------------------
>>>>
>>>>
>>>> The output: ----------- Background correcting Normalizing Calculating
>>>> Expression Background correcting Calculating Expression
>>>>
>>>>
>>>> a: 0 probes with zero variance. x1: 3 probes with zero variance. x2: 0
>>>> probes with zero variance.
>>>>
>>>> PP_T1_1.CEL PP_T1_2.CEL PP_T2_1.CEL PP_T2_2c.CEL 35244          185
>>>> 236         269          253 198153         226         314
>>>> 320          213 48500          375         370         408
>>>> 373 109094         125         157         201          175 144686
>>>> 107         131         153          100 214110         181
>>>> 296         307          234 47302          355         430
>>>> 397          282 108656         117         130         150
>>>> 118 147528         158         144         159          135 237691
>>>> 208         244         253          169 229910          94
>>>> 138         147          133
>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 2.9.0 Under development (unstable) (2009-02-03 r47829)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C
>>>>
>>>>
>>>> attached base packages: [1] stats     graphics  grDevices datasets
>>>> utils     methods   base
>>>>
>>>> other attached packages: [1] yeast2cdf_2.3.0
>>>> preprocessCore_1.5.3 genefilter_1.23.1 [4] affy_1.21.7
>>>> Biobase_2.3.10       fortunes_1.3-6
>>>>
>>>> loaded via a namespace (and not attached): [1] affyio_1.11.3
>>>> annotate_1.21.3      AnnotationDbi_1.5.14 [4] DBI_0.2-4
>>>> RSQLite_0.7-1        splines_2.9.0 [7] survival_2.34-1
>>>> tools_2.9.0          xtable_1.5-4
>>>>
>>>>
>>>>
>>>> Best wishes Wolfgang
>>>>
>>>>
>>>> PS While trying to dissect this, I also experimented with
>>>> constructing special CDF environments (e.g. one that had one
>>>> 'probeset' for each probe), and I noted that repeatedly the C code
>>>> of rma caused core dumps. Perhaps a better response to unexpected
>>>> shapes of CDF environments would be to issue an error message. I
>>>> can provide details if interested.
>>>>
>>>> ---------------------------------------------------- Wolfgang
>>>> Huber, EMBL-EBI, http://www.ebi.ac.uk/huber
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> --
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> --
> ----------------------------------------------------
> Wolfgang Huber  EMBL-EBI  http://www.ebi.ac.uk/huber
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list