[BioC] Non-specific filtering for HyperGeometric/GSEA test

Tue May 11 01:41:27 CEST 2010

Dear list,

May I have a question about the non-specific filtering used for defining a
gene universe during HyperGeometric/GSEA test?

I have fifteen samples from Affymetrix. To remove probe sets that have
little variation across samples, I evaluated IQR of each probe set across
samples by either of the following two pieces of code:

# code one
> cutoff <- 0.5
> Iqr <- apply (exprs(eset), 1, IQR)
> selected <- (Iqr > cutoff)
> filtered <- eset[selected, ]
> dim(filtered)
Features  Samples
 11490       15

# code two
> library(genefilter)
> filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5,
filterByQuantile=TRUE)
> dim(filtered)
Features  Samples
 27337       15

I realized the differences in "filtered" given by above two methods may
come from the different definitions of IQR. In the first case, IQR was
computed by using the 'quantile' function rather than Tukey's format:
‘IQR(x) = quantile(x,3/4) - quantile(x,1/4)’, which was used in the second
case. I am aware the fact that the number of genes in the gene universe
would has significant effects on the test result. However, I am not sure
which IQR evaluation method will be a better choice for the
HyperGeometric/GSEA test? It would be appreciated very much if you could
shed some light on it!

Regards,
Yuan