[BioC] Filtering gene list prior to statistical testing

Robert Gentleman rgentlem at fhcrc.org
Wed Jun 25 11:03:45 CEST 2008


Hi,
   Good questions, and ones best discussed with a local statistical 
expert, mailing lists are typically not good resources for finding out 
about complex statistical issues; they do a much better job of providing 
help using the software.

  That said, a few hints below:

Johan van Heerden wrote:
> Dear All,
> 
> I have scoured the BioC mailing list in search of a clear answer regarding the filtering of a data sets prior to differential testing, in an attempt to circumvent the multiple testing problem. Although several opinions have been expressed over the last couple of years I have not yet found a convincing argument for or against this practice.  I would like to make a comment and would appreciate any constructive feedback, as I am not a Statistician but a Biologists.
> 
> As far as I can see the problem has been divided into 2 categories: (1) "Supervised" and (2) "Unsupervised" filtering, where (1) is based on some knowledge regarding the functional classes present in the data, as opposed to (2) which does not consider any such information.  Several criticism have been raised against the "Supervised" approach, with many people calling it flawed logic. My first comments are regarding the logic of "Supervised" filtering.
> 
> As an example: A data set consisting of two classes (Treatment 1 and Treatment 2) has been generated.  A fold-change is then used to enrich the data set for genes that show within class activity (i.e. select only genes that show a mean x-fold change between classes). This filtered data set is then used for differential testing.  
> 
> My first question is: How is this different (especially when working with "whole-genome" arrays) from having custom arrays constructed from genes known show a response to some treatment. I.e. Arrays will then be selectively printed with genes that are known to or expected to show a response. This is a type of "filtering" step that will yield arrays with highly reduced gene sets. This scenario can result from known knowledge about pathways or can arrise from a discovery based microarray experiment, where a researcher produces whole genome arrays and from there select "responsive" genes for the creation of targeted (or custom arrays). Surely this step-wise sample space reduction should be subject to the same criticism?
> 
  If you use one data set to select the genes, and a second one to 
analyze only those genes selected then all is fine, and one expects to 
see appropriate statistical behavior of most quantities.  This is 
basically what would happen if you did design a special array for your 
setting. If you use the same data set to do both then pretty much all 
the necessary assumptions have been violated, and no meaningful 
inference can be made from the p-values.  This is Stats 101 (or at least 
it used to be).

> Secondly, the supervised fold-change filter should not affect the statistic of each individual gene, but will have profound effects on the adjusted p-values. I have checked this only for t-tests and am not sure what the effect on more complex statistical differential testing methods would be. If the only effect of the "supervised" filtering step is the enrichment of class-specific responsive gene and a reduction in the severity of the p-value ADJUSTMENT (without affecting the actual statistic), this could surely be a very useful way of filtering data?
> 
  makes no sense to me - consult a local expert with a more explicit 
statement of what you don't understand.

> Wrt the "unsupervised" approaches: These approaches define some overall variability threshold which can be used to filter out genes that don't show a minimum degree of variability regardless of class. As far as I can tell there are several issues wrt this approach. (1) Some genes will be naturally "noisy", i.e. will show high levels of fluctuation regardless of class. These genes are likely to be included in a filter based on degree of varilablity. (2) Some genes might show low levels of variability (with small changes between classes) and could be important, but will be excluded if a filter is based on degree of variability.
> 
Yes, to 1) and to 2), for 1), you know that these genes may be 
informative about some phenotype (and typically they are, but perhaps 
not the one you get - whence the name non-specific filtering).  Genes 
that vary little across all samples are typically not informative for 
any phenotype (and hence not for the one(s) you might be interested in.

For 2), microarray technology has its limits - that is one of them. If 
genes that exhibit that type of behavior are likely to be important to 
you, then you need a different tool.  Put a slightly different way, 
keeping genes that exhibit that sort of behavior seems to enhance your 
pool for non-informative genes/probes, most of us are tyring to enhance 
for informative ones (your use case may be different).

> I would greatly appreciate some feedback on these comments, specifically some statistical substantiation as to why a "supervised" approach is "flawed", given the similar experimental strategies included in the paragraph on this approach.
> 
   Local experts are more likely to give you the help you want, and 
certainly posting with a signature is likely to be more successful here too.

  Robert

> Many Thanks!!
> Johan van Heerden
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org



More information about the Bioconductor mailing list