[BioC] advice on absent present filtering needed

Thu Oct 26 14:36:30 CEST 2006

Hi Mark,

Kimpel, Mark William wrote:
> I have a question about how to properly apply the MAS5 absent present
> filtering technique. Within my group, I am advocating setting a
> cutoff ratio of absent present across phenotypes (i.e. all samples),
> whereas a colleague is advocating applying the filter within
> phenotype and passing through the filter any probeset with the A/P
> ratio of >0.5 within any of the phenotypes (we have 3).
> 
> The argument my colleague makes is that some probesets may only be
> expressed by one phenotype and we want to keep these in, but be
> stringent within phenotype. This makes some biologic sense, but I am
> concerned that this filtering within phenotype will introduce bias as
> low expression levels, as it would seem to, at least in some cases,
> act like a fold filter at expression levels near the limit of
> reliable detection.

Personally, I am conflicted about pre-filtering data, but when/if I do 
so, I generally try to stick with 'agnostic' methods that don't account 
for the sample types.

I am conflicted about both the need for pre-filtering and the methods 
used to do so. If one is selecting genes based on a standard t-test, 
then clearly there needs to be some pre-filtering because (usually) 
there isn't much replication, and one would want to guard against 
selecting genes based on the very poor variance estimates that result. 
However, if you use some of the available shrinkage estimators (the 
eBayes() method in limma for one), then the shrinkage estimator is based 
on _all_ the probesets on the chip. If you remove the probesets that 
don't vary much, then you are biasing the shrinkage estimator that you 
will use in the subsequent eBayes() step.

I am also not convinced that a comparison between PM and MM expression 
levels is a reasonable measure of transcript presence. Since 30 - 40% of 
the MM probes on a given chip have larger intensity values than the 
corresponding PM probe, I worry that one might end up throwing out 
probesets based on bad MM probes rather than lack of information. I do 
realize that MAS5 uses the ideal mismatch (IM) rather than the MM 
intensity, but the algorithm used to come up with the IM is a bit ad hoc 
for my tastes.

In the past, I tended to use the kOverA() method available in 
genefilter. This is agnostic in that it doesn't require any particular 
subset to have a higher expression, but does require that _some_ samples 
do. One could argue that this isn't that reasonable because of the 
cutoff imposed, which presupposes that a probeset with an expression 
below X isn't interesting. Lately, if I do filter, I have been filtering 
probesets based on the variance over all samples. If the variance isn't 
greater than some ad hoc value (usually 0.1 for rma numbers), then it's 
outta there. This is probably a bit more defensible because I am not 
directly specifying a cutoff, but using variance instead of say, 
standard deviation, does tend to favor probesets with a larger average 
expression. However, a plot of mean expression vs variance indicates to 
me that this is not overwhelming.

Anyway, after all that rambling, I would say that you are probably 
advocating the better of the two filtering procedures. Although your 
colleague has a point, I think that method might bias your results. You 
could split the difference and require 33.33333% of the samples to be 
present ;-D

Best,

Jim

> 
> Advice?
> 
> Mark
> 
> Mark W. Kimpel MD
> 
> 
> Official Business Address:
> 
> Department of Psychiatry Indiana University School of Medicine PR
> M116 Institute of Psychiatric Research 791 Union Drive Indianapolis,
> IN 46202
> 
> Preferred Mailing Address:
> 
> 15032 Hunter Court Westfield, IN  46074
> 
> (317) 490-5129 Work, & Mobile
> 
> (317) 663-0513 Home (no voice mail please) 1-(317)-536-2730 FAX
> 
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald
University of Michigan
Affymetrix and cDNA Microarray Core
1500 E Medical Center Drive
Ann Arbor MI 48109
734-647-5623

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.