[BioC] Help using ENSMUSG ids in GOstats

James W. MacDonald jmacdon at med.umich.edu
Mon May 12 16:44:39 CEST 2008


Hi John,

Perhaps this will help a bit.

 > library(org.Mm.eg.db)
Loading required package: AnnotationDbi
Loading required package: Biobase
Loading required package: tools

Welcome to Bioconductor

   Vignettes contain introductory material. To view, type
   'openVignette()'. To cite Bioconductor, see
   'citation("Biobase")' and for packages 'citation(pkgname)'.

Loading required package: DBI
Loading required package: RSQLite
 > ls(2)
  [1] "org.Mm.eg"           "org.Mm.eg_dbconn"    "org.Mm.eg_dbfile"
  [4] "org.Mm.eg_dbInfo"    "org.Mm.eg_dbschema"  "org.Mm.egACCNUM"
  [7] "org.Mm.egACCNUM2EG"  "org.Mm.egALIAS2EG"   "org.Mm.egCHR"
[10] "org.Mm.egCHRLENGTHS" "org.Mm.egCHRLOC"     "org.Mm.egENSEMBL"
[13] "org.Mm.egENSEMBL2EG" "org.Mm.egENZYME"     "org.Mm.egENZYME2EG"
[16] "org.Mm.egGENENAME"   "org.Mm.egGO"         "org.Mm.egGO2ALLEGS"
[19] "org.Mm.egGO2EG"      "org.Mm.egMAP"        "org.Mm.egMAP2EG"
[22] "org.Mm.egMAPCOUNTS"  "org.Mm.egMGI"        "org.Mm.egMGI2EG"
[25] "org.Mm.egORGANISM"   "org.Mm.egPATH"       "org.Mm.egPATH2EG"
[28] "org.Mm.egPFAM"       "org.Mm.egPMID"       "org.Mm.egPMID2EG"
[31] "org.Mm.egPROSITE"    "org.Mm.egREFSEQ"     "org.Mm.egREFSEQ2EG"
[34] "org.Mm.egSYMBOL"     "org.Mm.egSYMBOL2EG"  "org.Mm.egUNIGENE"
[37] "org.Mm.egUNIGENE2EG"

 > ?org.Mm.egENSEMBL

You will probably also need to make use of the revmap() function. If we 
assume here that you have a character vector of Ensembl IDs called ENSMUSG:

gns <- mget(ENSMUSG, revmap(org.Mm.egENSEMBL))

will give you a list of Entrez Gene IDs. For GOstats you need to come up 
with a character vector of unique Entrez Gene IDs, so you may need to 
check for multiple Entrez Gene IDs for a particular Ensembl ID (no 
guarantee that there is a one-to-one mapping), and then get rid of 
duplicates (e.g., simply wrapping the above in unlist() is not likely 
what you want to do).

The same holds true for the universe, which is the set of genes that 
could have been selected from your chip. Once you have those things, the 
procedure is quite straightforward. An example with fake data:

First just get some random IDs:

 > gns <- unique(toTable(org.Mm.egENSEMBL)[1:100,1])
 > univ <- unique(toTable(org.Mm.egENSEMBL)[1:1000,1])

Now do the analysis:

 > param <- new("GOHyperGParams", geneIds = gns, universeGeneIds = univ, 
ontology = "BP", annotation = "org.Mm.eg.db")
 > hyp <- hyperGTest(param)
 > head(summary(hyp))
                GOBPID       Pvalue  OddsRatio
GO:0007229 GO:0007229 9.168712e-11 107.987805
GO:0010033 GO:0010033 1.255989e-06  25.192157
GO:0042391 GO:0042391 6.797840e-06   9.590361
GO:0007166 GO:0007166 1.404809e-05   2.941145
GO:0007190 GO:0007190 5.915149e-05  45.738636
GO:0031279 GO:0031279 5.915149e-05  45.738636
              ExpCount Count Size
GO:0007229  1.2413793    11   12
GO:0010033  1.1379310     8   11
GO:0042391  2.0689655    10   20
GO:0007166 15.9310345    32  154
GO:0007190  0.6206897     5    6
GO:0031279  0.6206897     5    6
                                                        Term
GO:0007229              integrin-mediated signaling pathway
GO:0010033                    response to organic substance
GO:0042391                 regulation of membrane potential
GO:0007166 cell surface receptor linked signal transduction
GO:0007190         activation of adenylate cyclase activity
GO:0031279                   regulation of cyclase activity

Best,

Jim



John Reid wrote:
> 
> 
> Robert Gentleman wrote:
>>
>>>>    I am also guessing you have not searched the email list archives 
>>>> for any of the several previous discussions (that is a good place to 
>>>> start).
>>> I did search the email list archives. Nothing came up. Can you 
>>> suggest a good search term?
>>
>>   GOstats seems like a good starting place.  Again, you seem not to 
>> want to say what you did search on, so I have no idea why nothing came 
>> up. The question has been asked quite a few times.
>>
> I did search on GOstats, that certainly didn't help me find an 
> annotation package. All the GOstats documentation says is that I need an 
> annotation package. It does not help the user determine how to find the 
> correct one. I'm not saying it should, just that this information is not 
> easy to find anywhere else either.
>>
>>   Given that you have mouse genes, then I think you might be able to 
>> rule out most of the annotation packages. The BioC views let you 
>> select an organism, which greatly reduces the set you would need to 
>> look at.
>> I get to this place with about 3 clicks from the top of the BioC page.
>>
>> http://www.bioconductor.org/packages/release/Mus_musculus.html
>>
>> And then since you don't have an array it seems unlikely that any of 
>> the array specific packages would be what you want.  I hope with a few 
>> minutes work you would have ended up at org.Mm.eg.db, which you may be 
>> able to adapt to your needs.  You may need some other tool (such as 
>> biomaRt) to map from what ever identifiers you are using to those in 
>> the annotation package (or they might be there already, again you 
>> haven't given us much of anything to work with).
> 
> I don't understand why you keep saying I haven't given you much to work 
> with. The question surely is: Are ENSMUSG identifiers mapped in an 
> annotation package so that I can use them in GOstats? This seemed clear 
> to me in the first list post. Perhaps I have misunderstood some of the 
> issues but at the moment I don't see what. Maybe you could enlighten me?
> 
> I did end up at org.Mm.eg.Db myself also in a few clicks but it 
> certainly doesn't use Ensembl identifiers, its description clearly 
> states Entrez genes. So like you say I have extra work to do to map the 
> identifiers.
> 
> Thanks for the help,
> John.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623



More information about the Bioconductor mailing list