[BioC] advice on absent present filtering needed
James W. MacDonald
jmacdon at med.umich.edu
Thu Oct 26 14:36:30 CEST 2006
Hi Mark,
Kimpel, Mark William wrote:
> I have a question about how to properly apply the MAS5 absent present
> filtering technique. Within my group, I am advocating setting a
> cutoff ratio of absent present across phenotypes (i.e. all samples),
> whereas a colleague is advocating applying the filter within
> phenotype and passing through the filter any probeset with the A/P
> ratio of >0.5 within any of the phenotypes (we have 3).
>
> The argument my colleague makes is that some probesets may only be
> expressed by one phenotype and we want to keep these in, but be
> stringent within phenotype. This makes some biologic sense, but I am
> concerned that this filtering within phenotype will introduce bias as
> low expression levels, as it would seem to, at least in some cases,
> act like a fold filter at expression levels near the limit of
> reliable detection.
Personally, I am conflicted about pre-filtering data, but when/if I do
so, I generally try to stick with 'agnostic' methods that don't account
for the sample types.
I am conflicted about both the need for pre-filtering and the methods
used to do so. If one is selecting genes based on a standard t-test,
then clearly there needs to be some pre-filtering because (usually)
there isn't much replication, and one would want to guard against
selecting genes based on the very poor variance estimates that result.
However, if you use some of the available shrinkage estimators (the
eBayes() method in limma for one), then the shrinkage estimator is based
on _all_ the probesets on the chip. If you remove the probesets that
don't vary much, then you are biasing the shrinkage estimator that you
will use in the subsequent eBayes() step.
I am also not convinced that a comparison between PM and MM expression
levels is a reasonable measure of transcript presence. Since 30 - 40% of
the MM probes on a given chip have larger intensity values than the
corresponding PM probe, I worry that one might end up throwing out
probesets based on bad MM probes rather than lack of information. I do
realize that MAS5 uses the ideal mismatch (IM) rather than the MM
intensity, but the algorithm used to come up with the IM is a bit ad hoc
for my tastes.
In the past, I tended to use the kOverA() method available in
genefilter. This is agnostic in that it doesn't require any particular
subset to have a higher expression, but does require that _some_ samples
do. One could argue that this isn't that reasonable because of the
cutoff imposed, which presupposes that a probeset with an expression
below X isn't interesting. Lately, if I do filter, I have been filtering
probesets based on the variance over all samples. If the variance isn't
greater than some ad hoc value (usually 0.1 for rma numbers), then it's
outta there. This is probably a bit more defensible because I am not
directly specifying a cutoff, but using variance instead of say,
standard deviation, does tend to favor probesets with a larger average
expression. However, a plot of mean expression vs variance indicates to
me that this is not overwhelming.
Anyway, after all that rambling, I would say that you are probably
advocating the better of the two filtering procedures. Although your
colleague has a point, I think that method might bias your results. You
could split the difference and require 33.33333% of the samples to be
present ;-D
Best,
Jim
>
> Advice?
>
> Mark
>
> Mark W. Kimpel MD
>
>
> Official Business Address:
>
> Department of Psychiatry Indiana University School of Medicine PR
> M116 Institute of Psychiatric Research 791 Union Drive Indianapolis,
> IN 46202
>
> Preferred Mailing Address:
>
> 15032 Hunter Court Westfield, IN 46074
>
> (317) 490-5129 Work, & Mobile
>
> (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX
>
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald
University of Michigan
Affymetrix and cDNA Microarray Core
1500 E Medical Center Drive
Ann Arbor MI 48109
734-647-5623
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
More information about the Bioconductor
mailing list