[BioC] Filtering before differential expression analysis of microarrays - New paper out

Gordon Smyth smyth at wehi.EDU.AU
Wed Jan 14 00:09:30 CET 2009

Dear Dan,

It's very common practice to keep all the probes for normalization, 
then to filter control probes and consistently non-expressed probes 
before differential expression analysis.  I recommend and do it this 
myself. It's such common practice that it's surprising to see a paper 
on it at this stage.

It is in the spirit of normalization methods that all probes should 
be retained for normalization, except in unusual cases in which some 
probes are obviously poor quality for reasons other than expression level.

At the differential expression step, probes can be usefully filtered 
out if they are not of any potential interest.  This means control 
probes, or probes which appear to be non-expressed across all 
conditions in the experiment, i.e., on all arrays. I have frequently 
complained on this mailing list about the practice of filtering 
individual low intensity probes on individual arrays, which IMO is a 
very destructive practice. If you filter a probe on the basis of 
expression, it must be filtered on all arrays.

Filtering non-expressed probes tends not be emphasised on this list 
because users of this list are often sophisticated enough to use 
variance stabilizing normalization methods such as rma, vsn, normexp 
or vst.  This means that low-expression filtering is done more for 
multiplicity issues than for variance stabilization, and therefore 
often doesn't make a huge difference.  When using earlier 
normalization methods such as MAS for Affy or local background 
correction for two-color arrays, expression-filtering is absolutely 
essential, because the normalized expression values are so unstable 
at low intensity levels.

To James, it is not necessary to give retain all the probes on the 
array for eBayes().  The only requirement is that eBayes() sees all 
the probes which are under consideration for differential 
expression.  So filtering out consistently non-expressed probes 
before linear modelling is generally a good idea.  In fact, filtering 
often improves the eBayes() assumptions. eBayes assumes that the 
residual variances are not intensity-dependent. However very lowly 
expressed probes often follow a mean-variance relationship which is 
somewhat different from the other probes, even after variance 
stabilization, in which case filtering will improve the constancy of 
variance assumption.  This tends not to be a big issue with rma-Affy 
data, but it is an important issue with vst-Illumina data for example.

Best wishes

>Date: Mon, 12 Jan 2009 09:25:02 -0500
>From: "James W. MacDonald" <jmacdon at med.umich.edu>
>Subject: Re: [BioC] Filtering before differential expression analysis
>         of microarrays - New paper out
>To: Daniel Brewer <daniel.brewer at icr.ac.uk>
>Cc: bioconductor at stat.math.ethz.ch
>Hi Dan,
>Daniel Brewer wrote:
>>There is a new paper out at BMC bioinformatics that seems to justify the
>>use of filtering before differential expression analysis is performed
>>(Hackstadt & Hess BMC Bioinformatics 2009, 10:11 -
>>http://www.biomedcentral.com/1471-2105/10/11/abstract).  Specifically
>>filtering by variance and detection call.  I have got the impression
>>from this list that the general opinion is that one should only filter
>>out the control genes before testing.  I was wondering if anyone had any
>>opinions on this paper and the topic in general.
>I'm sure people do have opinions about this topic ;-D
>The reason people have so many opinions is because it isn't a simple
>question, and it depends on what you consider important.
>If you are just trying to limit the number of multiple comparisons to
>increase power, then filtering first is probably the way to go.
>If you are concerned with the accuracy of the FDR estimates, then
>filtering first may not be ideal.
>If you are using limma (Hackstadt and Hess used multtest), then you
>should filter after the eBayes step but before the FDR step, as an
>assumption of the eBayes step is that all of the data from the chip are
>Unless of course you are concerned about the accuracy of the FDR
>estimates, in which case... well you see the point.
>With microarray data analysis the arguments for and against a particular
>way of doing things can shed more heat than light, as nobody really
>knows the underlying truth, and the measures we use are really far
>removed from the actual phenomenon we are testing.
>>Many thanks
>James W. MacDonald, M.S.
>Hildebrandt Lab
>1150 W. Medical Center Drive
>Ann Arbor MI 48109-5646

More information about the Bioconductor mailing list