[BioC] Gene enrichment question
Aliaksei Holik
salvador at bio.bsu.by
Thu Aug 16 01:57:56 CEST 2012
Dear all,
Thank you for your answers and suggestions. Rather than replying to each
of you, I'm just going to summarise.
I've looked into hypergeometric test suggested by Alex, but admittedly
couldn't get my head round the jargon. So I focused on other methods
suggested. However, Alex's suggestion to remove 'stem cell genes' not
present in my dataset was most helpful, as indeed some 56 genes were
missing from my array.
I haven't tried going into GSEA as suggested by Steve, but I do have
expression values for the genes and will try to have a go at it later.
I have then looked into Fisher's exact test and bootstrapping suggested
by Michael and Steve. I couldn't figure out, how to use 'boot' function
to get it to sample a limited size sample (86) from a given population
(17119), so I ended up doing it "manually". In fact, I tried both
permutation and bootstrapping, by using sampling without and with
replacement, but couldn't see much difference in distribution. Any
comments on bootstrapping vs permutation and the ideal number of
replications are much appreciated.
I'm still not quite sure if I calculated final p value correctly and
certain that the whole task could have been accomplished more elegantly.
So if anyone is kind enough to read through all my coding, and correct
me, I would be very grateful. I just hope the script is readable enough.
Here it is:
##########
# Following character vectors contain names of:
# all.genes - all genes in dataset (17119)
# de.genes - differentially expressed genes in dataset (86)
# scs.genes - stem cell signature genes present in dataset (454)
# de.vs.scs - DE genes present in stem cell signature (9)
# Vectors with similar names, but with prefix 'n' (n.de.genes) contain #
numerical vectors of length of each character vector
## Run a Fisher's exact test on DE genes in SCS
# Generate a contingency table
c.table <- matrix(c(n.all.genes, n.de.genes,
+ n.scs.genes, n.de.vs.scs), nrow = 2)
# Run Fisher's test and assign observed p value to 'z.obs'
z.obs <- fisher.test(c.table)$p.value
## Write a function to generate a random sample of DE genes from all
## the genes in array and calculate Fisher's statistics on it
boot.f.test <- function(k){ # Where k is randomly drawn sample
# of 86 genes from 17119 gene pool
n <- length(intersect(k, array.vs.SCS))
# Compare sampled 86 genes to the ones in SCS
c.table <- matrix(c(n.all.genes, n.de.genes,
+ n.scs.genes, n), nrow = 2)
fisher.test(c.table)
}
## Resample DE genes 1000 times and calculate Fisher's exact test
z <- vector()
for (i in 1:1000){
# Sample 86 genes from the pool of 17119
boot.de.genes <- sample(all.genes.names,
size=length(de.genes.names),
replace = F
)
# Calculate Fisher's test on the sample
z[i] <- boot.f.test(boot.de.genes)$p.value
}
# Calculate p value as a proportion of simulated p values equal or
# smaller than observed p value
boot.p.value <- length(which(z <= z.obs))/length(z)
boot.p.value
### End of Script
Many thanks!
Aliaksei.
More information about the Bioconductor
mailing list