[BioC] LARGE Change in GCRMA expression values between versions (REPEAT QUESTION: PLEASE REPLY)

Thu Jul 10 22:40:56 CEST 2008

Hi,

Since we by default use GCRMA in our pipeline, I (also) had a look how
the newest GCRMA library (2.12.1) performed on several of our datasets.
I must admit the outcomes were rather startling to me....

Based on some testing we did few years ago we decided to use GCRMA
_only_ using the option fast=FALSE. Thus:
gcrma.data <- gcrma(data, fast=FALSE)

The reason for this was mainy because when comparing the effects of a
synthetic drug (agonist) in wild-type and receptor-null mice we noticed
a very strange p-value distribution when we used the option 'fast=TRUE'.
As expected, it peaked at the left (= because of the differential
expessed genes), but instead of gradually fading out to the right we
noticed a small bump in the middle of this distribution graph. This
became extremely prominent when we compared the effects of the drug in
the receptor null mice; we clearly observed a bump in the middle of this
curve although we expected an almost flat curve.... (The receptor this
drug binds to was knocked out in these null mice, so we _knew_ treatment
with this drug should only have a very little effect, if all, on gene
expression! [in contrast to the WT mice]).

Remarkably, this unexpected bump completely disappeared when using the
same CEL files and GCRMA with the option  'fast=FALSE'. 
See slides 3 and 4 here:
http://members.chello.nl/g.hooiveld/pics/GCRMA_evaluation.pdf
We therefore concluded that we better should not use the maximum
likelihood estimate (MLE) for background, but rather the Emperical Bayes
(EB) estimate for BG correction.

Wei Keat Lim et al. then reported about a specific problem with GCRMA,
and proposed a modification. Since as a non-statistician I could not
fully understand what his modification did, I decided to ask him about
some practical aspects (see below for full conversation). Based on this
I learned that both ML and EB estimates for bakckground are affected,
although at different levels.

So, considering the above, with the release of GCRMA v2.12.1 I decided
to analyse/compare the before mentioned arrays again. Results of this
are at slides 1 and 2 [note however that the width of the bins is
smaller than those in slides 3&4, but this does not affect the message].
Thus, I expected that the latest GCRMA would not show this strange bump
in the p-value distribution plots anymore, regardles whether the ML
(fast=TRUE) or EB (fast=FALSE) estimate for BG was used. However, this
is in contrast with what I observed, for fast=TRUE I noticed this bump
is still there, but (this is new) also that the p-values 'peaked'
completely at the right of the graph (p=0.95-1), both in the WT and
receptor null mice! Surprisingly, for fast=FALSE the p-value distr
looked as expected....

What the exact cause (and solution) is for this strange behaviour I
leave to the real knowledgeable people/experts, but I thought it is good
to share my observations.

Regards,
Guido 

E-mail conversation with Wei Keat Lim in August 2007:
Q (Guido): do you expect that the specific artifacts introduced using
the ML estimate also occur when using the EB estimate (because our data
suggest EB seems not to be affected)? 
A (Lim): Yes, the artifact introduced in GCRMA is independent of the
method used (MLE or EB). You observed a difference in your results
because the default threshold used to truncate uninformative probe is
lower in EB. The threshold is a function of 'fast', i.e.
k=6*fast+0.5*(1-fast). Hence, k=6 in MLE and k=0.5 in EB. The artifact
still exists but less obvious in EB implementation.

I then followed up on this:

Q (Guido):
> Thus, based on these analyses I conclude that when using the EB
estimate for NSB calculation, your mod doesn't really alter/improve the
expression level determination. However, in your first reply you
mentioned that your mod should also affect the "slow" gcRMA. Based on my
data, do you agree this is only marginally and therefore there is no
real need to apply your mod? Or did I overlook something and am I
completely wrong? Related to this, have you checked how the "slow"
default gcRMA is positioned in Fig8 of your paper?
A (WKL):
> Thanks a lot for sharing the information. The part in gcrma that we
suggest to modify does not relate to MLE or EB methods. The fact that
you observed a difference between them is just because they are using a
different cutoff value (k) by default. You will see a big difference if
you choose k=6 (default in MLE) in EB implementation.
> Another thing to note is that we did not expect to see big changes in
differential expression analysis, and what we showed earlier is just the
effects in gene-gene statistical dependencies. You can always use
gcrma-EB if that does not affect downstream analysis in your dataset. We
just proposed a correction that could eliminate artificial gene-gene
correlation in the implementation, independent of MLE, EB and parameter
k.
> Thanks,
> Wei Keat

This was followed by:
Q (Guido):
I just read your paper again and you clearly show that your GCRMA
modification improves the estimation of gene-gene correlations. Since
you applied the MLE method of GCRMA this effect is [likely] more
pronounced than when applying the EB implementation (for reasons you
described; it's dependent on setting of k). 
Since we normally use the EB implementation, I used your code on one of
our datasets, and found that this did not really affect the calculated
gene expression levels and subsequent statistical analysis. Therefore I
concluded that for us there is no need to use your modded GCRMA
algorithm. However, this may be a wrong conclusion since you adjusted
the GSB procedure, which is the same for the MLE or EB implementation.
In other words, it still would be wise to use your modified GCRMA code,
because I did not look at parameters that specifically 'tested' the
impact of the mod (which you did in your paper). Thus form our point of
view: your modification will at least gives outcomes identical to the
default GCRMA (EB imlementation) ("worst" case), but it will likely
(although slightly) improve our analyses as well.
Guido

A (WKL):
Guido,
Yes. You got the point :)
It's possible that you observe changes in gene-gene correlation but not
in differential expression. Imagine that there are 2 genes in 10
samples. Both of them have low expression in control samples and high
expression in treatment samples. Artificial gene-gene correlation in
these samples does not change the fact that level of expression is
different between control and treatment. The effect is just to increase
the correlation among control/treatment samples. 
Best,
Wei Keat

------------------------------------------------ 
Guido Hooiveld, PhD 
Nutrition, Metabolism & Genomics Group
Division of Human Nutrition 
Wageningen University 
Biotechnion, Bomenweg 2 
NL-6703 HD Wageningen 
the Netherlands 
tel: (+)31 317 485788 
fax: (+)31 317 483342 
internet:   http://nutrigene.4t.com
email:      guido.hooiveld at wur.nl

> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch 
> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Zhijin Wu
> Sent: 08 July 2008 18:22
> To: Richard Friedman
> Cc: Rafael A. Irizarry; bio c bioconductor
> Subject: Re: [BioC] LARGE Change in GCRMA expression values 
> between versions (REPEAT QUESTION: PLEASE REPLY)
> 
> Yes, the 2.12 version has made the change so probes without 
> affinity information do not go through GSB(gene specific 
> binding) adjustment.  So there may be a small number of 
> probesets that are affected.
> 
> best
> Zhijin
>  Richard Friedman wrote:
> > Dear Zhijin,
> >
> >     (I sent the following in May and  June to the list (worded 
> > slightly differently) but did not receive a reply.
> >  I would appreciate it if you, or someone on the list could  could 
> > clear up this problem)
> >
> >     I am noticing a large change in the absolute values of 
> intensity 
> > measurements
> >
> >     For the same probeset and and array normalized with the same 8 
> > other arrays done with GCRMA 2.10 I got 5.27 but for GCRMA 
> 2.12  I got 
> > 3.14
> >
> > Does this sound like a change that can be expected between versions 
> > 2.10 and 2.12, or does it sound as if I had made an error 
> of some kind?
> >
> > Is the change due to revising GCRMA to take the criticism  
> of Lim et 
> > al. into account?
> > Is so has version 2.12 been benchmarked against affycomp?
> >
> >
> > Thanks and best wishes,
> > Rich
> > ------------------------------------------------------------
> > Richard A. Friedman, PhD
> > Associate Research Scientist,
> > Biomedical Informatics Shared Resource Herbert Irving Comprehensive 
> > Cancer Center (HICCC) Lecturer, Department of Biomedical 
> Informatics 
> > (DBMI) Educational Coordinator, Center for Computational 
> Biology and 
> > Bioinformatics (C2B2)/ National Center for Multiscale Analysis of 
> > Genomic Networks (MAGNet) Box 95, Room 130BB or P&S 1-420C Columbia 
> > University Medical Center 630 W. 168th St.
> > New York, NY 10032
> > (212)305-6901 (5-6901) (voice)
> > friedman at cancercenter.columbia.edu
> > http://cancercenter.columbia.edu/~friedman/
> >
> > In Memoriam,
> > Algirdas Jonas Budrys
> >
> >
> >
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
>