[BioC] GO term enrichment analysis over whole genome, copy number aberration investigation

Fri Sep 5 12:11:23 CEST 2008

Hi,

I currently have a list of HUGO gene ids which relate to genes in
areas of gain over a whole chromosome, and would like to perform GO
enrichment analysis on them. So I have 2 problems:

1. currently i have been defining my gene universe based on affymetrix
arrays, however now I am working over the whole genome. gene_universe
= getBM(c("entrezgene"), mart = ensembl) ......however this leaves me
with a gene_universe of 20275 gene ids (is this right?)
2. moving from my HUGO identifiers to entrez gene ids? I can do this
using biomaRt
test = getBM(c("entrezgene"), filters = "hgnc_symbol", values =
stGained, mart = ensembl)

however, this is not the same length as my number of hugo gene
identifiers (in my case 30 are missing). Why is this? Is this just
some weird annotation bug that can't be fixed or is it the way I m
doing it. Does the bioconductor have the GO information for all genes
in the genome and not just those in the annotation files for the
affymetrix arrays?

Finally.....what are the statistical implications of performing GO
enrichment (Im using a conditional test) over a whole genome, would it
be better to run the gene set enrichment analysis on each chromosome
(I don think so)? I m trying to find evidence that genes relating to
certain functions are gained over the whole chromosome (cancer study).
I've ran a test one and have found some things which make sense.

Many thanks in advance,

Nathan