[BioC] strange effect of "half" bkg substraction in Limma

Wed Aug 2 20:49:50 CEST 2006

Jose,

We are facing the same issue analyzing
array CGH data from pairs of single colour
NimbleGen chips.  Since NimbleGen data formats
are new and fluid, most R and BioC packages
are still not able to routinely read in and process
the data.  This has the beneficial effect of forcing
us to look at plots and re-examine the data and 
correction / normalisation  processing routines.  

Our NimbleGen chips do not have MM for each PM, but
rather NimbleGen adds about a thousand "RANDOM" 
probes with characteristics 'similar' to the probe set under 
investigation.  Thus we can not do the usual MM subtraction
PM(i) - MM(i).  Subtracting the median background, or similar
variants (including the "half" algorithm), 
induces strong structural changes at the left edge
of the MA data, resembling the shape of the '<' left angle bracket.

Irizarry et al (2003) p. 257 discuss this issue.

  "Affymetrix also appears to have noticed that the linear scale is 
  not appropriate and, in the new version of their analysis algorithm 
  MAS 5.0, are no using a log scale measure.  Specifically the MAS 5.0
  signal (measure) is defined as
     signal = Tukey Biweight{log(PM(j) - CT(j))}
  with CT(j) a quantity derived from the MM that is never bigger than
  its PM pair.  See Hubbell (2001) for more details.

  Each of these measures rely upon the difference PM - MM with
  the intention of correcting for non-specific binding.  However, the
  exploratory analysis presented in Section 3 suggests that the
  MM may be detecting signal as well as non-specific binding.
  Some researchers (Naef et al, 2001) propose expression measures
  based only on the PM."

I have not yet come across journal articles investigating
specific binding (references would be appreciated) to the MM
probes, but this may be part of the issue.  Perhaps it is GC
related?  We will be investigating this issue.

NimbleGen has adopted the RMA algorithm currently
used by AffyMetrix for background correction and
normalization.  BioConductor has package gcrma that
also uses GC information - a review of gcrma correction
is available at
http://bioinf.ncl.ac.uk:16080/support/courses/genespring/RMA%20comparison%20with%20MAS5.pdf

Not subtracting MM values appears to have much merit.
One obvious benefit is that data is not lost because of the 
artificial phenomena of not being able to take the logarithm 
of a negative number.

Apparently mismatch probes are doing more than had
been originally thought by many.  Background correction is
still an evolving issue.  Your observations illustrate the
continued importance of plotting data and thinking about
the available algorithms and their effects.  Perhaps RMA
or GCRMA algorithms will produce more reasonable results?

Your latest set of arrays and your a-priori knowledge
should help sort out part of this puzzle.  Let us know what
you discover about improved correction / normalisation 
methods that allow your known DE genes to show themselves.

Reference:

Irizarry R., et al. (2003).  "Exploration, normalization,
and summaries of high density oligonucleotide array probe 
level data".  Biostatistics Vol 4 (2), pp. 249 - 264

Steven McKinney

Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

email: smckinney at bccrc.ca

tel: 604-675-8000 x7561

BCCRC
Molecular Oncology
675 West 10th Ave, Floor 4
Vancouver B.C. 
V5Z 1L3
Canada

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch on behalf of J.delasHeras at ed.ac.uk
Sent: Wed 8/2/2006 6:16 AM
To: <BioC Mailing List
Subject: [BioC] strange effect of "half" bkg substraction in Limma

I use Limma to analyse my 2-colour cDNA arrays.
I usually either simply substract background (method "subtract"), or 
don't correct for background at all (for a number of reasons I will not 
go into now).

In one of my latest sets of arrays, I was fortunate enough to know some 
of teh expected genes to be differentially expressed a priori (from 
previous experiments and RT-PCR confirmation).

I substracted the background, as I did for a similar set of arrays 
(same experiments on a different cell line), and looked for the genes I 
knew to be differentially expressed. They were not in the list. 
Actually, they gave me NA when I looked for them on my normalised data 
object.
The reason for this, I found out, was that I was having slides with 
higher background than usual (especially on Cy3 channel), and the local 
background for that group of genes was higher than the actual signal 
measured on ONE of the channels. This gave me a negative intensity 
value after bkg substraction... and that's where the problem lies.

Okay... so I looked at how many spots had negative values after 
substraction in at least one channel. Lots. I expect lots of spots to 
show no signal in either channel, so it's not surprising. But a good 
number will probably have no signal only on one channel. These are 
actually the genes I am mainly after: those that show no expression 
before my treatment, but get activated to some degree after the 
treatment.

I decided to convert the negative intensities to some arbitrary number 
that wouldn't give me trouble.
I decided to avoid a value between 0 and 1 (logs would be negative or 
zero) and chose 1.5. Just because.

I then used the RG data, corrected that way, to continue. I normalised 
within arrays (print-tip loess) and between arrays (scale). Then I 
applied the linear model as usual plus eBayes on that. Then I looked to 
see what happened to my group of known genes. They were not eliminated 
this time, that's good. BUT they were not marked as DE (using FDR <= 
0.05). In fact... EVERY spot had FDR above 0.9!!!

I thought maybe I had made a mistake in the correction... so I quickly 
repeated the procedure using the "half" methgod to substract 
background. This is essentially what I did before, but substituting 
negative values by 0.5, rather than 1.5.
same thing!

You can see the MA plots for one set of data, using either no 
background correction, the "substract" method, the "half" method, or my 
own correction choosing "0.5" (so it's the same as "half"... I only put 
it to make sure my method did what it was supposed to do):

http://mcnach.com/MISC/MAplots1.png

It seems that using the "half" method flattens all the differences, 
after normalisation... I am guessing this is some effect of the 
normalisation procedure...
I used "half" on another set of data once, without this effect... the 
data was already "flattish" where all the M values were no bigger than 
2.5, and the background was pretty low generally.

Any ideas about what's happening here?

Incidentally, I took the data and re-analysed it without any background 
correction at all. The MA plot for the same set looks like this:

http://mcnach.com/MISC/MAplots2.png

which is nice... I expect a relatively large number of genes to be 
upregulated, and many to be activated (going from no signal or almost 
nothing, to a clearly detectable signal), and these show nicely along 
the top upwards diagonal of the diamond-shaped plot (where genes that 
have signal only on my treatment are expected to cluster).
My known genes show up also on that diagonal, and their relative 
position also fits nicely with the results obtained by RT (more 
strongly reactivated genes show higher on the diagonal). The FDR values 
also appeared reasonable.
Surprisingly (to me), nor removing background, even when I had some 
slides that didn't look so good, gives pretty solid results.

Jose

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor