[BioC] Invalid fold-filter

Tue Feb 21 18:17:22 CET 2006

Hi,

  In substance I agree with Naomi, but I do want to suggest that there 
are likely to be biases (statistical sense) introduced by filtering on a 
lack of annotation and I personally would want to deal with that at the 
end of the analysis, not at the beginning.

  Not all molecular systems are equally studied, or published on, and if 
your experiment has intersected with one of these, then pre-filtering 
will hide that information from you. In some cases this is not a 
concern, but in others it may be.

  Of course you can do little with the data if there is no annotation - 
but even there, you can get the sequence and do some reasonable stuff 
with that much information these days.

  On the approach of filtering on variation, I did some simulation 
studies to convince myself it was not a big problem (with respect to 
bias), when I first started doing it. You should do your own simulations 
if you wonder about the effect of different procedures (it is pretty 
simple).

  best wishes
    Robert

Naomi Altman wrote:
> I think it is unwise to filter based on observed data.  This biases the 
> results.
> 
> On the other hand, filtering on a priori considerations, such as lack of 
> annotation, should not be a problem.
> 
> --Naomi
> 
> At 12:19 AM 2/21/2006, Robert Gentleman wrote:
> 
> 
>> Bornman, Daniel M wrote:
>> > Robert,
>> >
>> > After reading your response to my initial question, I do not believe 
>> you
>> > addressed exactly what I attempted to describe.  Please pardon me for
>> > not being clear.  I think your response assumed I was filtering on
>> > unadjusted p-values then applying a correction such as Benjamini &
>> > Hochberg to a reduced set.
>> >
>> > My question was rather on the validity of first filtering each gene
>> > based on fold-change between two sample groups (i.e. controls vs
>> > treated) then calculating a test-statistic, raw p-value and corrected
>> > p-value on each gene that passed the fold-change filter.  I am worried
>> > that using the group phenotype description to filter followed by
>> > applying a p-value correction is unfairly reducing my multiple
>> > comparison penalty.
>>
>>   It does, and it does not matter how you get there, using one test
>> (fold change) to filter and a different test (t-test) for p-value
>> correction does not really change the fact that if both tests make use
>> of the same way to define samples, then there are problems with the
>> interpretation.
>>
>> >
>> > I propose that a less biased approach to fold-filtering would be to
>> > filter probes based on the mean of the lower half versus the mean of 
>> the
>> > upper half of expression values at each probe regardless of the
>> > phenotype (non-specific).  This would surely (except in some instances
>> > where a phenotype causes drastic expression changes) cause the
>> > fold-filtered set to be larger and thus not unfairly decrease the
>> > multiple comparison penalty when computing adjusted p-values.
>> >
>>
>>    Well that is one thing, but really IMHO, you are better off filtering
>> on variance than some rather arbitrary division into two groups (why 1/2
>> - many of the classification problems I deal with are very unbalanced
>> and 1/2 would be a pretty bad choice). And, it is variation that is
>> important (whence ANOVA - ANalysis Of VAriance).
>>
>>   Best wishes,
>>     Robert
>>
>> >
>> > Thank You,
>> > Daniel
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
>> > Sent: Saturday, February 18, 2006 12:57 PM
>> > To: Bornman, Daniel M
>> > Subject: Re: [BioC] Invalid fold-filter
>> >
>> > Hi Daniel,
>> >   I hope not, it is as you have noted a flawed approach.
>> >
>> > best wishes
>> >    Robert
>> >
>> > Bornman, Daniel M wrote:
>> >
>> >>I of course agree that filtering on a variable (phenotype) that will
>> >>be used later to calculate adjusted p-values is flawed and therefore
>> >>it is not a method I would implement; however, it seems that many that
>> >
>> >
>> >>describe fold-filtering are doing just that.
>> >>Thank you for your response.
>> >>
>> >>-----Original Message-----
>> >>From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
>> >>Sent: Friday, February 17, 2006 2:15 PM
>> >>To: Bornman, Daniel M
>> >>Cc: bioconductor at stat.math.ethz.ch
>> >>Subject: Re: [BioC] Invalid fold-filter
>> >>
>> >>
>> >>
>> >>Bornman, Daniel M wrote:
>> >>
>> >>
>> >>>Dear BioC Folks,
>> >>>
>> >>>As a bioinformatician within a Statistics department I often consult
>> >>>with real statisticians about the most appropriate test to apply to
>> >>>our microarray experiments.  One issue that is being debated among our
>> >>
>> >>
>> >>>statisticians is whether some types of fold-filtering may be invalid
>> >>>or biased in nature.  The types of fold-filtering in question are
>> >>>those that tend to NOT be non-specific.
>> >>>Some filtering of a 54K probe affy chip is useful prior to making
>> >>>decisions on differential expression and there are many examples in
>> >>>the Bioconductor documentation (particularly in the {genefilter}
>> >>>package) on how to do so.  A popular method of non-specific filtering
>> >>>for reducing your probeset prior to applying statistics is to filter
>> >>>out low expressed probes followed by filtering out probes that do not
>> >>>show a minimum difference between quartiles.  These two steps are
>> >>>non-specific in that they do not take into consideration the actual
>> >>
>> >>samples/arrays.
>> >>
>> >>
>> >>>On the other hand, if we had two groups of samples, say control versus
>> >>
>> >>
>> >>>treated, and we filtered out those probes that do not have a mean
>> >>>difference in expression of 2-fold between the control and treated
>> >>>groups, this filtering was based on the actual samples.  This is NOT a
>> >>
>> >>
>> >>>non-specific filter.  The problem then comes (or rather the debate
>> >>>here
>> >>>arises) when a t-test is calculated for each probe that passed the
>> >>>sample-specific fold-filtering and the p-values are adjusted for
>> >>>multiple comparisons by, for example the Benjamini & Hochberg method.
>> >>>Is it valid to fold-filter using the sample identity as a criteria
>> >>>followed by correcting for multiple comparisons using just those
>> >>>probes that made it through the fold-filter?  When correcting for
>> >>>multiple comparisons you take a penalty for the number of comparison
>> >>>you are correcting.  The larger the pool of comparisons, the larger
>> >>>the penalty, thus the larger the adjusted p-value.  Or more
>> >>>importantly, the smaller the set, the less your adjusted p-value is
>> >>>adjusted (increased) relative to your raw p-value.  The argument is
>> >>>that you used the actual samples themselves you are comparing to
>> >>>unfairly reduce the adjusted p-value penalty.
>> >>
>> >>
>> >>  It is not valid to use phenotype to compute t-statistics for a
>> >>particular phenotype and filter based on those p-values and to then
>> >>use p-value correction methods on the result. I don't think we need
>> >>research, it seems pretty obvious that this is not a valid approach.
>> >>
>> >>   You can do non-specific filtering, but all you are really doing
>> >>there is to remove genes that are inherently uninteresting no matter
>> >>what the phenotype of the corresponding sample (if there is no
>> >
>> > variation in
>> >
>> >>expression for a particular gene across samples then it has   no
>> >>information about the phenotype of the sample). Filtering on low
>> >>values is probably a bad idea although many do it (and I used to, and
>> >>still do sometimes depending on the task at hand).
>> >>
>> >>
>> >>  Best wishes
>> >>    Robert
>> >>
>> >>
>> >>
>> >>>Has anyone considered this issue or heard of problems of using a
>> >>>specific type of filtering rather than a non-specific one?
>> >>>Thank You for any responses.
>> >>>
>> >>>Daniel Bornman
>> >>>Research Scientist
>> >>>Battelle Memorial Institute
>> >>>505 King Ave
>> >>>Columbus, OH 43201
>> >>>
>> >>>_______________________________________________
>> >>>Bioconductor mailing list
>> >>>Bioconductor at stat.math.ethz.ch
>> >>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>>
>> >>
>> >>
>> >>--
>> >>Robert Gentleman, PhD
>> >>Program in Computational Biology
>> >>Division of Public Health Sciences
>> >>Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876
>> >>PO Box 19024 Seattle, Washington 98109-1024 206-667-7700
>> >>rgentlem at fhcrc.org
>> >>
>> >>_______________________________________________
>> >>Bioconductor mailing list
>> >>Bioconductor at stat.math.ethz.ch
>> >>https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>
>> >
>> >
>> > --
>> > Robert Gentleman, PhD
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M2-B876
>> > PO Box 19024
>> > Seattle, Washington 98109-1024
>> > 206-667-7700
>> > rgentlem at fhcrc.org
>> >
>>
>> -- 
>> Robert Gentleman, PhD
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> PO Box 19024
>> Seattle, Washington 98109-1024
>> 206-667-7700
>> rgentlem at fhcrc.org
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> 
> 
> Naomi S. Altman                                814-865-3791 (voice)
> Associate Professor
> Dept. of Statistics                              814-863-7114 (fax)
> Penn State University                         814-865-1348 (Statistics)
> University Park, PA 16802-2111
> 
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org