[BioC] Non-specific filtering for HyperGeometric/GSEA test
Wolfgang Huber
whuber at embl.de
Wed May 12 00:50:54 CEST 2010
Dear Yuan
have a look into the manual page of "varFilter", which indicates that
its 'var.cutoff' argument is interpreted as the quantile of the overall
distribution of variances to be used as cutoff; whereas in your "code
one" the "cutoff" is interpreted as the actual variance value to be used
for the cutoff.
Try with
selected <- (Iqr > quantile(Iqr, probs=cutoff))
the result of this should be nearly the same as with "code 2".
Why only "nearly"? You are right that "varFilter" does something odd
when "var.func = IQR", namely it calls "rowIQRs", which runs a little
bit faster, but produces a different result; you can verify this by
typing "varFilter" and reading its code. (One might argue that the
effort of understanding what this function does exceeds the effort of
doing it from scratch...)
So, both code versions should produce nearly identical results, and the
results of the downstream analysis (GSEA) should not depend sensitively
on this.
Best wishes
Wolfgang
On 11/05/10 01:41, Yuan Hao wrote:
> Dear list,
>
> May I have a question about the non-specific filtering used for defining a
> gene universe during HyperGeometric/GSEA test?
>
> I have fifteen samples from Affymetrix. To remove probe sets that have
> little variation across samples, I evaluated IQR of each probe set across
> samples by either of the following two pieces of code:
>
> # code one
>> cutoff<- 0.5
>> Iqr<- apply (exprs(eset), 1, IQR)
>> selected<- (Iqr> cutoff)
>> filtered<- eset[selected, ]
>> dim(filtered)
> Features Samples
> 11490 15
>
> # code two
>> library(genefilter)
>> filtered<-varFilter(eset, var.func=IQR, var.cutoff=0.5,
> filterByQuantile=TRUE)
>> dim(filtered)
> Features Samples
> 27337 15
>
> I realized the differences in "filtered" given by above two methods may
> come from the different definitions of IQR. In the first case, IQR was
> computed by using the 'quantile' function rather than Tukey's format:
> ‘IQR(x) = quantile(x,3/4) - quantile(x,1/4)’, which was used in the second
> case. I am aware the fact that the number of genes in the gene universe
> would has significant effects on the test result. However, I am not sure
> which IQR evaluation method will be a better choice for the
> HyperGeometric/GSEA test? It would be appreciated very much if you could
> shed some light on it!
>
> Regards,
> Yuan
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list