[BioC] Difficulties in using the mgsa package for Gene Set Analysis

Wed Jan 16 17:53:24 CET 2013

Dear list,

I have been trying to apply the MGSA method for gene set analysis to my data by using the mgsa package that is part of the Bioconductor release, but so far I haven't been able to make it work.

When using the package's readGAF function to create the list of gene sets from the GO categories with the Rat files downloaded from the GO webpage (http://www.geneontology.org/GO.downloads.annotations.shtml), the resulting object looks like this (edited for brevity):

Object of class MgsaGoSets
16779 sets over 29266 unique items.

Set annotations:
                                        term
GO:0000002 mitochondrial genome maintenan...
...
GO:0000014 Catalysis of the hydrolysis of...
... and  16774  other sets.

Item annotations:
         symbol                              name
1302934 St8sia5 ST8 alpha-N-acetyl-neuraminide...
...
1302939   Eef1g eukaryotic translation elongat...
... and  29261  other items.

Applying the function mgsa() to my list of differentially expressed genes and these gene sets doesn't work, as it looks for matches between the 'symbol' category in the gene sets and the genes of interest. However, the numbers in the 'symbol' category are RGD IDs (from the Rat Genome Database, http://rgd.mcw.edu/), and I haven't been able to find a way to either change these to something else (Entrez ID, gene symbol, etc) or somehow get the RGD IDs for my genes of interest without looking for them manually.

So, in order to apply MGSA to my data, I am hoping to get some help on how to do one of these three things:

1) Modify the MgsaGoSets object so it uses as 'symbol' a more common gene ID, such as Entrez ID, instead of RGD ID.

2) Obtain the RGD IDs of my list of differentially expressed genes from a more common gene ID.

3) Create a named list of vectors of gene identifiers, where each GO category is one item in the list and has associated a vector of all the Gene IDs that comprise the category, in a similar way to the process explained in the third section of the package creator's Bioinformatics paper (PMID: 21561920).

I would welcome any suggestion you may have, as I am quite interested in comparing the results of this analysis to other gene set analysis methods. Thanks in advance for your help!

Juan

 -- output of sessionInfo(): 

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
 [1] mgsa_1.6.0           gplots_2.11.0        MASS_7.3-22         
 [4] KernSmooth_2.23-8    caTools_1.14         gdata_2.12.0        
 [7] gtools_2.7.0         BiocInstaller_1.8.3  xtable_1.7-0        
[10] GOstats_2.24.0       graph_1.36.1         Category_2.24.0     
[13] rat2302cdf_2.11.0    genefilter_1.40.0    RColorBrewer_1.0-5  
[16] affycoretools_1.30.0 KEGG.db_2.8.0        GO.db_2.8.0         
[19] annotate_1.36.0      rat2302.db_2.8.1     org.Rn.eg.db_2.8.0  
[22] RSQLite_0.11.2       DBI_0.2-5            AnnotationDbi_1.20.3
[25] limma_3.14.3         affy_1.36.0          Biobase_2.18.0      
[28] BiocGenerics_0.4.0  

loaded via a namespace (and not attached):
 [1] AnnotationForge_1.0.3 Biostrings_2.26.2     GSEABase_1.20.1      
 [4] IRanges_1.16.4        RBGL_1.34.0           RCurl_1.95-3         
 [7] XML_3.95-0.1          affyio_1.26.0         annaffy_1.30.0       
[10] biomaRt_2.14.0        bitops_1.0-4.2        gcrma_2.30.0         
[13] lattice_0.20-10       parallel_2.15.2       preprocessCore_1.20.0
[16] splines_2.15.2        stats4_2.15.2         survival_2.36-14     
[19] tools_2.15.2          zlibbioc_1.4.0   

--
Sent via the guest posting facility at bioconductor.org.