[BioC] Gene Pre-filtering: My Two Shekels

Fri Jun 27 02:23:02 CEST 2008

Hi all,

Allow me to add my perspective as a relative newcomer into this field.

At first I too was alarmed by the apparent violation of statistical  
orthodoxy involved in pre-filtering. But after witnessing how well  
this works on real data, my opinion has changed.

I feel that either the statistician's perspective of p-values and  
inference or the data-miner's perspective of signal vs. noise and  
informative probes, may be misleading if taken in isolation.

What has helped me is thinking of the original scientific problem. We  
have a large number of genes, belonging (roughly speaking) to three  
groups: differentially expressed, non-differentially expressed, and  
not expressed at all. Typically, our task is to identify the first  
group.

Now, neglecting to pre-filter is equivalent to conflating the second  
and third groups (or, equivalently, assuming that the third group does  
not exist). Indeed, the current prevalent differential-expression  
methodology ignores the existence of 3 groups. This obviously leads to  
errors.

Prefiltering via nsFilter or otherwise (e.g., the McClintick and  
Edenbert 2006 article referred to by Mark) is equivalent to trying to  
identify and remove the third group, and then use DE methodology to  
separate the first two. A more sophisticated version of prefiltering  
has been recently suggested by Calza et al. 2007:

S. Calza, W. Raffelsberger, A. Ploner et al. Filtering genes to  
improve sensitivity in oligonucleodtide microarray data analysis.  
Nucleic Acids Research 35, #16, e102.

I haven't tried this on any data yet, but they do have a home-grown R  
package available.

My own gut feel is that much can be gained by looking at all 3 groups  
together and trying to distinguish between them in "one fell swoop".  
Once the problem is seen this way, we have all the pattern-recognition  
arsenal of machine learning at our disposal.

Cheers, Assaf