[BioC] Invalid fold-filter

Tue Feb 21 17:49:11 CET 2006

I think it is unwise to filter based on observed data.  This biases 
the results.

On the other hand, filtering on a priori considerations, such as lack 
of annotation, should not be a problem.

--Naomi

At 12:19 AM 2/21/2006, Robert Gentleman wrote:

>Bornman, Daniel M wrote:
> > Robert,
> >
> > After reading your response to my initial question, I do not believe you
> > addressed exactly what I attempted to describe.  Please pardon me for
> > not being clear.  I think your response assumed I was filtering on
> > unadjusted p-values then applying a correction such as Benjamini &
> > Hochberg to a reduced set.
> >
> > My question was rather on the validity of first filtering each gene
> > based on fold-change between two sample groups (i.e. controls vs
> > treated) then calculating a test-statistic, raw p-value and corrected
> > p-value on each gene that passed the fold-change filter.  I am worried
> > that using the group phenotype description to filter followed by
> > applying a p-value correction is unfairly reducing my multiple
> > comparison penalty.
>
>   It does, and it does not matter how you get there, using one test
>(fold change) to filter and a different test (t-test) for p-value
>correction does not really change the fact that if both tests make use
>of the same way to define samples, then there are problems with the
>interpretation.
>
> >
> > I propose that a less biased approach to fold-filtering would be to
> > filter probes based on the mean of the lower half versus the mean of the
> > upper half of expression values at each probe regardless of the
> > phenotype (non-specific).  This would surely (except in some instances
> > where a phenotype causes drastic expression changes) cause the
> > fold-filtered set to be larger and thus not unfairly decrease the
> > multiple comparison penalty when computing adjusted p-values.
> >
>
>    Well that is one thing, but really IMHO, you are better off filtering
>on variance than some rather arbitrary division into two groups (why 1/2
>- many of the classification problems I deal with are very unbalanced
>and 1/2 would be a pretty bad choice). And, it is variation that is
>important (whence ANOVA - ANalysis Of VAriance).
>
>   Best wishes,
>     Robert
>
> >
> > Thank You,
> > Daniel
> >
> >
> >
> > -----Original Message-----
> > From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
> > Sent: Saturday, February 18, 2006 12:57 PM
> > To: Bornman, Daniel M
> > Subject: Re: [BioC] Invalid fold-filter
> >
> > Hi Daniel,
> >   I hope not, it is as you have noted a flawed approach.
> >
> > best wishes
> >    Robert
> >
> > Bornman, Daniel M wrote:
> >
> >>I of course agree that filtering on a variable (phenotype) that will
> >>be used later to calculate adjusted p-values is flawed and therefore
> >>it is not a method I would implement; however, it seems that many that
> >
> >
> >>describe fold-filtering are doing just that.
> >>Thank you for your response.
> >>
> >>-----Original Message-----
> >>From: Robert Gentleman [mailto:rgentlem at fhcrc.org]
> >>Sent: Friday, February 17, 2006 2:15 PM
> >>To: Bornman, Daniel M
> >>Cc: bioconductor at stat.math.ethz.ch
> >>Subject: Re: [BioC] Invalid fold-filter
> >>
> >>
> >>
> >>Bornman, Daniel M wrote:
> >>
> >>
> >>>Dear BioC Folks,
> >>>
> >>>As a bioinformatician within a Statistics department I often consult
> >>>with real statisticians about the most appropriate test to apply to
> >>>our microarray experiments.  One issue that is being debated among our
> >>
> >>
> >>>statisticians is whether some types of fold-filtering may be invalid
> >>>or biased in nature.  The types of fold-filtering in question are
> >>>those that tend to NOT be non-specific.
> >>>Some filtering of a 54K probe affy chip is useful prior to making
> >>>decisions on differential expression and there are many examples in
> >>>the Bioconductor documentation (particularly in the {genefilter}
> >>>package) on how to do so.  A popular method of non-specific filtering
> >>>for reducing your probeset prior to applying statistics is to filter
> >>>out low expressed probes followed by filtering out probes that do not
> >>>show a minimum difference between quartiles.  These two steps are
> >>>non-specific in that they do not take into consideration the actual
> >>
> >>samples/arrays.
> >>
> >>
> >>>On the other hand, if we had two groups of samples, say control versus
> >>
> >>
> >>>treated, and we filtered out those probes that do not have a mean
> >>>difference in expression of 2-fold between the control and treated
> >>>groups, this filtering was based on the actual samples.  This is NOT a
> >>
> >>
> >>>non-specific filter.  The problem then comes (or rather the debate
> >>>here
> >>>arises) when a t-test is calculated for each probe that passed the
> >>>sample-specific fold-filtering and the p-values are adjusted for
> >>>multiple comparisons by, for example the Benjamini & Hochberg method.
> >>>Is it valid to fold-filter using the sample identity as a criteria
> >>>followed by correcting for multiple comparisons using just those
> >>>probes that made it through the fold-filter?  When correcting for
> >>>multiple comparisons you take a penalty for the number of comparison
> >>>you are correcting.  The larger the pool of comparisons, the larger
> >>>the penalty, thus the larger the adjusted p-value.  Or more
> >>>importantly, the smaller the set, the less your adjusted p-value is
> >>>adjusted (increased) relative to your raw p-value.  The argument is
> >>>that you used the actual samples themselves you are comparing to
> >>>unfairly reduce the adjusted p-value penalty.
> >>
> >>
> >>  It is not valid to use phenotype to compute t-statistics for a
> >>particular phenotype and filter based on those p-values and to then
> >>use p-value correction methods on the result. I don't think we need
> >>research, it seems pretty obvious that this is not a valid approach.
> >>
> >>   You can do non-specific filtering, but all you are really doing
> >>there is to remove genes that are inherently uninteresting no matter
> >>what the phenotype of the corresponding sample (if there is no
> >
> > variation in
> >
> >>expression for a particular gene across samples then it has   no
> >>information about the phenotype of the sample). Filtering on low
> >>values is probably a bad idea although many do it (and I used to, and
> >>still do sometimes depending on the task at hand).
> >>
> >>
> >>  Best wishes
> >>    Robert
> >>
> >>
> >>
> >>>Has anyone considered this issue or heard of problems of using a
> >>>specific type of filtering rather than a non-specific one?
> >>>Thank You for any responses.
> >>>
> >>>Daniel Bornman
> >>>Research Scientist
> >>>Battelle Memorial Institute
> >>>505 King Ave
> >>>Columbus, OH 43201
> >>>
> >>>_______________________________________________
> >>>Bioconductor mailing list
> >>>Bioconductor at stat.math.ethz.ch
> >>>https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>
> >>
> >>
> >>--
> >>Robert Gentleman, PhD
> >>Program in Computational Biology
> >>Division of Public Health Sciences
> >>Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876
> >>PO Box 19024 Seattle, Washington 98109-1024 206-667-7700
> >>rgentlem at fhcrc.org
> >>
> >>_______________________________________________
> >>Bioconductor mailing list
> >>Bioconductor at stat.math.ethz.ch
> >>https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>
> >
> >
> > --
> > Robert Gentleman, PhD
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M2-B876
> > PO Box 19024
> > Seattle, Washington 98109-1024
> > 206-667-7700
> > rgentlem at fhcrc.org
> >
>
>--
>Robert Gentleman, PhD
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M2-B876
>PO Box 19024
>Seattle, Washington 98109-1024
>206-667-7700
>rgentlem at fhcrc.org
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111