[BioC] Invalid fold-filter

Tue Feb 21 06:19:21 CET 2006

Bornman, Daniel M wrote:
> Robert,
> 
> After reading your response to my initial question, I do not believe you
> addressed exactly what I attempted to describe.  Please pardon me for
> not being clear.  I think your response assumed I was filtering on
> unadjusted p-values then applying a correction such as Benjamini &
> Hochberg to a reduced set.
> 
> My question was rather on the validity of first filtering each gene
> based on fold-change between two sample groups (i.e. controls vs
> treated) then calculating a test-statistic, raw p-value and corrected
> p-value on each gene that passed the fold-change filter.  I am worried
> that using the group phenotype description to filter followed by
> applying a p-value correction is unfairly reducing my multiple
> comparison penalty.

  It does, and it does not matter how you get there, using one test 
(fold change) to filter and a different test (t-test) for p-value 
correction does not really change the fact that if both tests make use 
of the same way to define samples, then there are problems with the 
interpretation.

> 
> I propose that a less biased approach to fold-filtering would be to
> filter probes based on the mean of the lower half versus the mean of the
> upper half of expression values at each probe regardless of the
> phenotype (non-specific).  This would surely (except in some instances
> where a phenotype causes drastic expression changes) cause the
> fold-filtered set to be larger and thus not unfairly decrease the
> multiple comparison penalty when computing adjusted p-values.   
> 

   Well that is one thing, but really IMHO, you are better off filtering 
on variance than some rather arbitrary division into two groups (why 1/2 
- many of the classification problems I deal with are very unbalanced 
and 1/2 would be a pretty bad choice). And, it is variation that is 
important (whence ANOVA - ANalysis Of VAriance).

  Best wishes,
    Robert

> 
> Thank You,
> Daniel  
> 
> 
> 
> -----Original Message-----
> From: Robert Gentleman [mailto:rgentlem at fhcrc.org] 
> Sent: Saturday, February 18, 2006 12:57 PM
> To: Bornman, Daniel M
> Subject: Re: [BioC] Invalid fold-filter
> 
> Hi Daniel,
>   I hope not, it is as you have noted a flawed approach.
> 
> best wishes
>    Robert
> 
> Bornman, Daniel M wrote:
> 
>>I of course agree that filtering on a variable (phenotype) that will 
>>be used later to calculate adjusted p-values is flawed and therefore 
>>it is not a method I would implement; however, it seems that many that
> 
> 
>>describe fold-filtering are doing just that.
>>Thank you for your response.
>>
>>-----Original Message-----
>>From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
>>Sent: Friday, February 17, 2006 2:15 PM
>>To: Bornman, Daniel M
>>Cc: bioconductor at stat.math.ethz.ch
>>Subject: Re: [BioC] Invalid fold-filter
>>
>>
>>
>>Bornman, Daniel M wrote:
>>
>>
>>>Dear BioC Folks,
>>>
>>>As a bioinformatician within a Statistics department I often consult 
>>>with real statisticians about the most appropriate test to apply to 
>>>our microarray experiments.  One issue that is being debated among our
>>
>>
>>>statisticians is whether some types of fold-filtering may be invalid 
>>>or biased in nature.  The types of fold-filtering in question are 
>>>those that tend to NOT be non-specific.
>>>Some filtering of a 54K probe affy chip is useful prior to making 
>>>decisions on differential expression and there are many examples in 
>>>the Bioconductor documentation (particularly in the {genefilter}
>>>package) on how to do so.  A popular method of non-specific filtering 
>>>for reducing your probeset prior to applying statistics is to filter 
>>>out low expressed probes followed by filtering out probes that do not 
>>>show a minimum difference between quartiles.  These two steps are 
>>>non-specific in that they do not take into consideration the actual
>>
>>samples/arrays.
>>
>>
>>>On the other hand, if we had two groups of samples, say control versus
>>
>>
>>>treated, and we filtered out those probes that do not have a mean 
>>>difference in expression of 2-fold between the control and treated 
>>>groups, this filtering was based on the actual samples.  This is NOT a
>>
>>
>>>non-specific filter.  The problem then comes (or rather the debate 
>>>here
>>>arises) when a t-test is calculated for each probe that passed the 
>>>sample-specific fold-filtering and the p-values are adjusted for 
>>>multiple comparisons by, for example the Benjamini & Hochberg method.
>>>Is it valid to fold-filter using the sample identity as a criteria 
>>>followed by correcting for multiple comparisons using just those 
>>>probes that made it through the fold-filter?  When correcting for 
>>>multiple comparisons you take a penalty for the number of comparison 
>>>you are correcting.  The larger the pool of comparisons, the larger 
>>>the penalty, thus the larger the adjusted p-value.  Or more 
>>>importantly, the smaller the set, the less your adjusted p-value is 
>>>adjusted (increased) relative to your raw p-value.  The argument is 
>>>that you used the actual samples themselves you are comparing to 
>>>unfairly reduce the adjusted p-value penalty.
>>
>>
>>  It is not valid to use phenotype to compute t-statistics for a 
>>particular phenotype and filter based on those p-values and to then 
>>use p-value correction methods on the result. I don't think we need 
>>research, it seems pretty obvious that this is not a valid approach.
>>
>>   You can do non-specific filtering, but all you are really doing 
>>there is to remove genes that are inherently uninteresting no matter 
>>what the phenotype of the corresponding sample (if there is no
> 
> variation in
> 
>>expression for a particular gene across samples then it has   no 
>>information about the phenotype of the sample). Filtering on low 
>>values is probably a bad idea although many do it (and I used to, and 
>>still do sometimes depending on the task at hand).
>>
>>
>>  Best wishes
>>    Robert
>>
>>
>>
>>>Has anyone considered this issue or heard of problems of using a 
>>>specific type of filtering rather than a non-specific one?
>>>Thank You for any responses.
>>>
>>>Daniel Bornman
>>>Research Scientist
>>>Battelle Memorial Institute
>>>505 King Ave
>>>Columbus, OH 43201
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>
>>
>>
>>--
>>Robert Gentleman, PhD
>>Program in Computational Biology
>>Division of Public Health Sciences
>>Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 
>>PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 
>>rgentlem at fhcrc.org
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
> 
> 
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem at fhcrc.org
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org