[BioC] Filtering gene list prior to statistical testing

James W. MacDonald jmacdon at med.umich.edu
Tue Jun 24 17:30:41 CEST 2008


Hi Johan,

Johan van Heerden wrote:
> Dear All,
> 
> I have scoured the BioC mailing list in search of a clear answer
> regarding the filtering of a data sets prior to differential testing,
> in an attempt to circumvent the multiple testing problem. Although
> several opinions have been expressed over the last couple of years I
> have not yet found a convincing argument for or against this
> practice.  I would like to make a comment and would appreciate any
> constructive feedback, as I am not a Statistician but a Biologists.
> 
> As far as I can see the problem has been divided into 2 categories:
> (1) "Supervised" and (2) "Unsupervised" filtering, where (1) is based
> on some knowledge regarding the functional classes present in the
> data, as opposed to (2) which does not consider any such information.
> Several criticism have been raised against the "Supervised" approach,
> with many people calling it flawed logic. My first comments are
> regarding the logic of "Supervised" filtering.
> 
> As an example: A data set consisting of two classes (Treatment 1 and
> Treatment 2) has been generated.  A fold-change is then used to
> enrich the data set for genes that show within class activity (i.e.
> select only genes that show a mean x-fold change between classes).
> This filtered data set is then used for differential testing.
> 
> My first question is: How is this different (especially when working
> with "whole-genome" arrays) from having custom arrays constructed
> from genes known show a response to some treatment. I.e. Arrays will
> then be selectively printed with genes that are known to or expected
> to show a response. This is a type of "filtering" step that will
> yield arrays with highly reduced gene sets. This scenario can result
> from known knowledge about pathways or can arrise from a discovery
> based microarray experiment, where a researcher produces whole genome
> arrays and from there select "responsive" genes for the creation of
> targeted (or custom arrays). Surely this step-wise sample space
> reduction should be subject to the same criticism?

It is. When normalizing microarray data, the assumption being made is 
that many (most) of the genes being measured are not actually changing 
expression. What most normalization schemes do is line up the bulk of 
the data so on average the log fold change is zero. If we can't make 
this assumption (e.g., it is possible that _all_ genes are up-regulated 
in one sample), then without having some housekeeping genes to use for 
the normalization, there is no way to normalize the data without making 
some strong and possibly unwarranted assumptions.

So the main argument as I see it against doing supervised sample space 
reduction is that you may be removing the main assumption of most 
normalization schemes. The normalization is really the important thing 
here, as you are trying to remove unwanted technical variation that will 
have a much larger effect on your statistics than the multiple testing 
issue.

> 
> Secondly, the supervised fold-change filter should not affect the
> statistic of each individual gene, but will have profound effects on
> the adjusted p-values. I have checked this only for t-tests and am
> not sure what the effect on more complex statistical differential
> testing methods would be. If the only effect of the "supervised"
> filtering step is the enrichment of class-specific responsive gene
> and a reduction in the severity of the p-value ADJUSTMENT (without
> affecting the actual statistic), this could surely be a very useful
> way of filtering data?
> 
> Wrt the "unsupervised" approaches: These approaches define some
> overall variability threshold which can be used to filter out genes
> that don't show a minimum degree of variability regardless of class.
> As far as I can tell there are several issues wrt this approach. (1)
> Some genes will be naturally "noisy", i.e. will show high levels of
> fluctuation regardless of class. These genes are likely to be
> included in a filter based on degree of varilablity. (2) Some genes
> might show low levels of variability (with small changes between
> classes) and could be important, but will be excluded if a filter is
> based on degree of variability.

This is all true, but again I think the normalization issue is much more 
important, and that is where we really want to make sure we are doing a 
good job.

These days people are getting much less interested in a list of 
differentially expressed genes, as these are often too large to be 
useful anyway. The real underlying goal of most experiments IMO, is to 
find pathways that are perturbed by some treatment/condition/whatever. 
In this case one really doesn't care about multiple testing, and instead 
is just using the t-stats (or whatever) in a GSEA type statistic to 
measure the difference in sets of genes.

Best,

Jim



> 
> I would greatly appreciate some feedback on these comments,
> specifically some statistical substantiation as to why a "supervised"
> approach is "flawed", given the similar experimental strategies
> included in the paragraph on this approach.
> 
> Many Thanks!! Johan van Heerden
> 
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623



More information about the Bioconductor mailing list