[BioC] Correct p-value in GSA (Gene set enrichment) permutation tests? [Scanned.]

Fri Sep 12 09:51:15 CEST 2008

Dear BioC Bioinformaticians,

I am using the package GSA for testing gene set enrichment in gene expression data.

GSA uses a permutation test for calculating p-values of enrichment.

Such p-values are usually defined as

p=(#(T* >= T)) / #B

where T is the test statistics originally observed, #B the number of permutations and
T* the test statistics observed for the permutated datasets.

However, function GSA implemented p=(#(T* > T)) / #B (as is also defined in the belonging article)
see article here:
http://www-stat.stanford.edu/~tibs/ftp/GSA.pdf

As a consequence, even for really insufficient small designs (say comparison of two independent groups,
both of size 2) the resulting p-values contain a lot of cases with p=0.
In my experience this is often the case for about half of the pathways under consideration.

For larger designs this difference might not be that crucial, but for really small designs,
I think that, this p-value calculation delivers far too overoptimistic results
(too many "significant" pathways).

Is there a motivation for this unusual p-value calculation or should the lines in the GSA function

(original:)
pvalues.hi[i] = sum(r.star[i, ] > r.obs[i])/nperms
pvalues.lo[i] = sum(r.star[i, ] < r.obs[i])/nperms

read instead:
pvalues.hi[i] = sum(r.star[i, ] >= r.obs[i])/nperms
pvalues.lo[i] = sum(r.star[i, ] <= r.obs[i])/nperms

Would be grateful for any comments or clarifications!!

sincerely

Dirk.

-- 
_____________________________________________________

Dr. Dirk Repsilber
Biomathematics / Bioinformatics group
Genetics and Biometry
Research Institute for the Biology of Farm Animals
FBN
Wilhelm-Stahl-Allee 2
D-18196 Dummerstorf
Tel: +49 38208 68 916
Fax: +49 38208 68 902
www.fbn-dummerstorf.de/de/Forschung/FBs/fb2/repsilber