[BioC] Invalid fold-filter
Robert Gentleman
rgentlem at fhcrc.org
Tue Feb 21 06:19:21 CET 2006
Bornman, Daniel M wrote:
> Robert,
>
> After reading your response to my initial question, I do not believe you
> addressed exactly what I attempted to describe. Please pardon me for
> not being clear. I think your response assumed I was filtering on
> unadjusted p-values then applying a correction such as Benjamini &
> Hochberg to a reduced set.
>
> My question was rather on the validity of first filtering each gene
> based on fold-change between two sample groups (i.e. controls vs
> treated) then calculating a test-statistic, raw p-value and corrected
> p-value on each gene that passed the fold-change filter. I am worried
> that using the group phenotype description to filter followed by
> applying a p-value correction is unfairly reducing my multiple
> comparison penalty.
It does, and it does not matter how you get there, using one test
(fold change) to filter and a different test (t-test) for p-value
correction does not really change the fact that if both tests make use
of the same way to define samples, then there are problems with the
interpretation.
>
> I propose that a less biased approach to fold-filtering would be to
> filter probes based on the mean of the lower half versus the mean of the
> upper half of expression values at each probe regardless of the
> phenotype (non-specific). This would surely (except in some instances
> where a phenotype causes drastic expression changes) cause the
> fold-filtered set to be larger and thus not unfairly decrease the
> multiple comparison penalty when computing adjusted p-values.
>
Well that is one thing, but really IMHO, you are better off filtering
on variance than some rather arbitrary division into two groups (why 1/2
- many of the classification problems I deal with are very unbalanced
and 1/2 would be a pretty bad choice). And, it is variation that is
important (whence ANOVA - ANalysis Of VAriance).
Best wishes,
Robert
>
> Thank You,
> Daniel
>
>
>
> -----Original Message-----
> From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
> Sent: Saturday, February 18, 2006 12:57 PM
> To: Bornman, Daniel M
> Subject: Re: [BioC] Invalid fold-filter
>
> Hi Daniel,
> I hope not, it is as you have noted a flawed approach.
>
> best wishes
> Robert
>
> Bornman, Daniel M wrote:
>
>>I of course agree that filtering on a variable (phenotype) that will
>>be used later to calculate adjusted p-values is flawed and therefore
>>it is not a method I would implement; however, it seems that many that
>
>
>>describe fold-filtering are doing just that.
>>Thank you for your response.
>>
>>-----Original Message-----
>>From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
>>Sent: Friday, February 17, 2006 2:15 PM
>>To: Bornman, Daniel M
>>Cc: bioconductor at stat.math.ethz.ch
>>Subject: Re: [BioC] Invalid fold-filter
>>
>>
>>
>>Bornman, Daniel M wrote:
>>
>>
>>>Dear BioC Folks,
>>>
>>>As a bioinformatician within a Statistics department I often consult
>>>with real statisticians about the most appropriate test to apply to
>>>our microarray experiments. One issue that is being debated among our
>>
>>
>>>statisticians is whether some types of fold-filtering may be invalid
>>>or biased in nature. The types of fold-filtering in question are
>>>those that tend to NOT be non-specific.
>>>Some filtering of a 54K probe affy chip is useful prior to making
>>>decisions on differential expression and there are many examples in
>>>the Bioconductor documentation (particularly in the {genefilter}
>>>package) on how to do so. A popular method of non-specific filtering
>>>for reducing your probeset prior to applying statistics is to filter
>>>out low expressed probes followed by filtering out probes that do not
>>>show a minimum difference between quartiles. These two steps are
>>>non-specific in that they do not take into consideration the actual
>>
>>samples/arrays.
>>
>>
>>>On the other hand, if we had two groups of samples, say control versus
>>
>>
>>>treated, and we filtered out those probes that do not have a mean
>>>difference in expression of 2-fold between the control and treated
>>>groups, this filtering was based on the actual samples. This is NOT a
>>
>>
>>>non-specific filter. The problem then comes (or rather the debate
>>>here
>>>arises) when a t-test is calculated for each probe that passed the
>>>sample-specific fold-filtering and the p-values are adjusted for
>>>multiple comparisons by, for example the Benjamini & Hochberg method.
>>>Is it valid to fold-filter using the sample identity as a criteria
>>>followed by correcting for multiple comparisons using just those
>>>probes that made it through the fold-filter? When correcting for
>>>multiple comparisons you take a penalty for the number of comparison
>>>you are correcting. The larger the pool of comparisons, the larger
>>>the penalty, thus the larger the adjusted p-value. Or more
>>>importantly, the smaller the set, the less your adjusted p-value is
>>>adjusted (increased) relative to your raw p-value. The argument is
>>>that you used the actual samples themselves you are comparing to
>>>unfairly reduce the adjusted p-value penalty.
>>
>>
>> It is not valid to use phenotype to compute t-statistics for a
>>particular phenotype and filter based on those p-values and to then
>>use p-value correction methods on the result. I don't think we need
>>research, it seems pretty obvious that this is not a valid approach.
>>
>> You can do non-specific filtering, but all you are really doing
>>there is to remove genes that are inherently uninteresting no matter
>>what the phenotype of the corresponding sample (if there is no
>
> variation in
>
>>expression for a particular gene across samples then it has no
>>information about the phenotype of the sample). Filtering on low
>>values is probably a bad idea although many do it (and I used to, and
>>still do sometimes depending on the task at hand).
>>
>>
>> Best wishes
>> Robert
>>
>>
>>
>>>Has anyone considered this issue or heard of problems of using a
>>>specific type of filtering rather than a non-specific one?
>>>Thank You for any responses.
>>>
>>>Daniel Bornman
>>>Research Scientist
>>>Battelle Memorial Institute
>>>505 King Ave
>>>Columbus, OH 43201
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>
>>
>>
>>--
>>Robert Gentleman, PhD
>>Program in Computational Biology
>>Division of Public Health Sciences
>>Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876
>>PO Box 19024 Seattle, Washington 98109-1024 206-667-7700
>>rgentlem at fhcrc.org
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>
>
>
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem at fhcrc.org
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
More information about the Bioconductor
mailing list