[BioC] Gene Pre-filtering: My Two Shekels
aoron at fhcrc.org
aoron at fhcrc.org
Fri Jun 27 02:23:02 CEST 2008
Hi all,
Allow me to add my perspective as a relative newcomer into this field.
At first I too was alarmed by the apparent violation of statistical
orthodoxy involved in pre-filtering. But after witnessing how well
this works on real data, my opinion has changed.
I feel that either the statistician's perspective of p-values and
inference or the data-miner's perspective of signal vs. noise and
informative probes, may be misleading if taken in isolation.
What has helped me is thinking of the original scientific problem. We
have a large number of genes, belonging (roughly speaking) to three
groups: differentially expressed, non-differentially expressed, and
not expressed at all. Typically, our task is to identify the first
group.
Now, neglecting to pre-filter is equivalent to conflating the second
and third groups (or, equivalently, assuming that the third group does
not exist). Indeed, the current prevalent differential-expression
methodology ignores the existence of 3 groups. This obviously leads to
errors.
Prefiltering via nsFilter or otherwise (e.g., the McClintick and
Edenbert 2006 article referred to by Mark) is equivalent to trying to
identify and remove the third group, and then use DE methodology to
separate the first two. A more sophisticated version of prefiltering
has been recently suggested by Calza et al. 2007:
S. Calza, W. Raffelsberger, A. Ploner et al. Filtering genes to
improve sensitivity in oligonucleodtide microarray data analysis.
Nucleic Acids Research 35, #16, e102.
I haven't tried this on any data yet, but they do have a home-grown R
package available.
My own gut feel is that much can be gained by looking at all 3 groups
together and trying to distinguish between them in "one fell swoop".
Once the problem is seen this way, we have all the pattern-recognition
arsenal of machine learning at our disposal.
Cheers, Assaf
More information about the Bioconductor
mailing list