[BioC] Gene enrichment question

Wed Aug 15 15:51:51 CEST 2012

Dear listers,

Apologies if my question is not strictly related to Bioconductor, though 
one never knows, maybe there's a package that does what I need.

I am analysing a list of differentially expressed genes from an Illumina 
microarray. In particular I'm trying to compare the list of 
differentially expressed genes to an existing list of genes 
preferentially expressed in the stem cell population (stem cell 
signature). When I do so, 10% of DE genes belong to the stem cell 
signature. What I'd like to do now is to find out, how likely that would 
happen by chance, i.e. put a p value on it.

At the moment I know:
There're 17119 unique genes in my dataset.
Of them 86 are differentially expressed.

The stem cell signature contains 510 genes.
It is combined from several platforms, which makes it hard to establish 
the total number of unique genes, but it's at least 20819 (the size of 
the largest platform).

There are 9 overlapping genes between DE genes and the stem cell signature.

So I wonder:

1) If there's an accepted way to calculate a p value using these data. 
For instance could I run a like of a chi squared test? E.g. stem cell 
specific genes represent 510/20819=2.4% of total dataset. So expected 
number of the stem cell genes in my DE genes would be 86x2.4%=2. So my 
chi squared test would be based on 9 observed vs 2 expected.

2) Or do I have to generate a geneset based on the stem cell signature 
and go through GSEA algorithms to calculate enrichment and significance.

Any pointers in the right direction would be much appreciated.

Many thanks for your time and help!

Aliaksei.