[BioC] strange effect of "half" bkg substraction in Limma
Steven McKinney
smckinney at bccrc.ca
Wed Aug 2 20:49:50 CEST 2006
Jose,
We are facing the same issue analyzing
array CGH data from pairs of single colour
NimbleGen chips. Since NimbleGen data formats
are new and fluid, most R and BioC packages
are still not able to routinely read in and process
the data. This has the beneficial effect of forcing
us to look at plots and re-examine the data and
correction / normalisation processing routines.
Our NimbleGen chips do not have MM for each PM, but
rather NimbleGen adds about a thousand "RANDOM"
probes with characteristics 'similar' to the probe set under
investigation. Thus we can not do the usual MM subtraction
PM(i) - MM(i). Subtracting the median background, or similar
variants (including the "half" algorithm),
induces strong structural changes at the left edge
of the MA data, resembling the shape of the '<' left angle bracket.
Irizarry et al (2003) p. 257 discuss this issue.
"Affymetrix also appears to have noticed that the linear scale is
not appropriate and, in the new version of their analysis algorithm
MAS 5.0, are no using a log scale measure. Specifically the MAS 5.0
signal (measure) is defined as
signal = Tukey Biweight{log(PM(j) - CT(j))}
with CT(j) a quantity derived from the MM that is never bigger than
its PM pair. See Hubbell (2001) for more details.
Each of these measures rely upon the difference PM - MM with
the intention of correcting for non-specific binding. However, the
exploratory analysis presented in Section 3 suggests that the
MM may be detecting signal as well as non-specific binding.
Some researchers (Naef et al, 2001) propose expression measures
based only on the PM."
I have not yet come across journal articles investigating
specific binding (references would be appreciated) to the MM
probes, but this may be part of the issue. Perhaps it is GC
related? We will be investigating this issue.
NimbleGen has adopted the RMA algorithm currently
used by AffyMetrix for background correction and
normalization. BioConductor has package gcrma that
also uses GC information - a review of gcrma correction
is available at
http://bioinf.ncl.ac.uk:16080/support/courses/genespring/RMA%20comparison%20with%20MAS5.pdf
Not subtracting MM values appears to have much merit.
One obvious benefit is that data is not lost because of the
artificial phenomena of not being able to take the logarithm
of a negative number.
Apparently mismatch probes are doing more than had
been originally thought by many. Background correction is
still an evolving issue. Your observations illustrate the
continued importance of plotting data and thinking about
the available algorithms and their effects. Perhaps RMA
or GCRMA algorithms will produce more reasonable results?
Your latest set of arrays and your a-priori knowledge
should help sort out part of this puzzle. Let us know what
you discover about improved correction / normalisation
methods that allow your known DE genes to show themselves.
Reference:
Irizarry R., et al. (2003). "Exploration, normalization,
and summaries of high density oligonucleotide array probe
level data". Biostatistics Vol 4 (2), pp. 249 - 264
Steven McKinney
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre
email: smckinney at bccrc.ca
tel: 604-675-8000 x7561
BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C.
V5Z 1L3
Canada
-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch on behalf of J.delasHeras at ed.ac.uk
Sent: Wed 8/2/2006 6:16 AM
To: <BioC Mailing List
Subject: [BioC] strange effect of "half" bkg substraction in Limma
I use Limma to analyse my 2-colour cDNA arrays.
I usually either simply substract background (method "subtract"), or
don't correct for background at all (for a number of reasons I will not
go into now).
In one of my latest sets of arrays, I was fortunate enough to know some
of teh expected genes to be differentially expressed a priori (from
previous experiments and RT-PCR confirmation).
I substracted the background, as I did for a similar set of arrays
(same experiments on a different cell line), and looked for the genes I
knew to be differentially expressed. They were not in the list.
Actually, they gave me NA when I looked for them on my normalised data
object.
The reason for this, I found out, was that I was having slides with
higher background than usual (especially on Cy3 channel), and the local
background for that group of genes was higher than the actual signal
measured on ONE of the channels. This gave me a negative intensity
value after bkg substraction... and that's where the problem lies.
Okay... so I looked at how many spots had negative values after
substraction in at least one channel. Lots. I expect lots of spots to
show no signal in either channel, so it's not surprising. But a good
number will probably have no signal only on one channel. These are
actually the genes I am mainly after: those that show no expression
before my treatment, but get activated to some degree after the
treatment.
I decided to convert the negative intensities to some arbitrary number
that wouldn't give me trouble.
I decided to avoid a value between 0 and 1 (logs would be negative or
zero) and chose 1.5. Just because.
I then used the RG data, corrected that way, to continue. I normalised
within arrays (print-tip loess) and between arrays (scale). Then I
applied the linear model as usual plus eBayes on that. Then I looked to
see what happened to my group of known genes. They were not eliminated
this time, that's good. BUT they were not marked as DE (using FDR <=
0.05). In fact... EVERY spot had FDR above 0.9!!!
I thought maybe I had made a mistake in the correction... so I quickly
repeated the procedure using the "half" methgod to substract
background. This is essentially what I did before, but substituting
negative values by 0.5, rather than 1.5.
same thing!
You can see the MA plots for one set of data, using either no
background correction, the "substract" method, the "half" method, or my
own correction choosing "0.5" (so it's the same as "half"... I only put
it to make sure my method did what it was supposed to do):
http://mcnach.com/MISC/MAplots1.png
It seems that using the "half" method flattens all the differences,
after normalisation... I am guessing this is some effect of the
normalisation procedure...
I used "half" on another set of data once, without this effect... the
data was already "flattish" where all the M values were no bigger than
2.5, and the background was pretty low generally.
Any ideas about what's happening here?
Incidentally, I took the data and re-analysed it without any background
correction at all. The MA plot for the same set looks like this:
http://mcnach.com/MISC/MAplots2.png
which is nice... I expect a relatively large number of genes to be
upregulated, and many to be activated (going from no signal or almost
nothing, to a clearly detectable signal), and these show nicely along
the top upwards diagonal of the diamond-shaped plot (where genes that
have signal only on my treatment are expected to cluster).
My known genes show up also on that diagonal, and their relative
position also fits nicely with the results obtained by RT (more
strongly reactivated genes show higher on the diagonal). The FDR values
also appeared reasonable.
Surprisingly (to me), nor removing background, even when I had some
slides that didn't look so good, gives pretty solid results.
Jose
--
Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list