[BioC] Gene enrichment question

Thu Aug 16 01:57:56 CEST 2012

Dear all,

Thank you for your answers and suggestions. Rather than replying to each 
of you, I'm just going to summarise.

I've looked into hypergeometric test suggested by Alex, but admittedly 
couldn't get my head round the jargon. So I focused on other methods 
suggested. However, Alex's suggestion to remove 'stem cell genes' not 
present in my dataset was most helpful, as indeed some 56 genes were 
missing from my array.

I haven't tried going into GSEA as suggested by Steve, but I do have 
expression values for the genes and will try to have a go at it later.

I have then looked into Fisher's exact test and bootstrapping suggested 
by Michael and Steve. I couldn't figure out, how to use 'boot' function 
to get it to sample a limited size sample (86) from a given population 
(17119), so I ended up doing it "manually". In fact, I tried both 
permutation and bootstrapping, by using sampling without and with 
replacement, but couldn't see much difference in distribution. Any 
comments on bootstrapping vs permutation and the ideal number of 
replications are much appreciated.

I'm still not quite sure if I calculated final p value correctly and 
certain that the whole task could have been accomplished more elegantly. 
So if anyone is kind enough to read through all my coding, and correct 
me, I would be very grateful. I just hope the script is readable enough. 
Here it is:

##########
# Following character vectors contain names of:
# all.genes - all genes in dataset (17119)
# de.genes - differentially expressed genes in dataset (86)
# scs.genes - stem cell signature genes present in dataset (454)
# de.vs.scs - DE genes present in stem cell signature (9)

# Vectors with similar names, but with prefix 'n' (n.de.genes) contain # 
numerical vectors of length of each character vector

## Run a Fisher's exact test on DE genes in SCS
	# Generate a contingency table
		c.table <- matrix(c(n.all.genes, n.de.genes,
				  + n.scs.genes, n.de.vs.scs), nrow = 2)

	# Run Fisher's test and assign observed p value to 'z.obs'
		z.obs <- fisher.test(c.table)$p.value

## Write a function to generate a random sample of DE genes from all
## the genes in array and calculate Fisher's statistics on it
	boot.f.test <- function(k){ # Where k is randomly drawn sample
				    # of 86 genes from 17119 gene pool
	n <- length(intersect(k, array.vs.SCS))
			# Compare sampled 86 genes to the ones in SCS
   c.table <- matrix(c(n.all.genes, n.de.genes,
		    + n.scs.genes, n), nrow = 2)
   fisher.test(c.table)
}

## Resample DE genes 1000 times and calculate Fisher's exact test
z <- vector()
for (i in 1:1000){
   # Sample 86 genes from the pool of 17119
   boot.de.genes <- sample(all.genes.names,
                           size=length(de.genes.names),
                           replace = F
                           )
   # Calculate Fisher's test on the sample
   z[i] <- boot.f.test(boot.de.genes)$p.value
}

# Calculate p value as a proportion of simulated p values equal or
# smaller than observed p value
boot.p.value <- length(which(z <= z.obs))/length(z)
boot.p.value

### End of Script

Many thanks!

Aliaksei.