[BioC] Mapping NCBI accession numbers to GO terms
Martin Morgan
mtmorgan at fhcrc.org
Fri May 21 14:56:41 CEST 2010
On 05/21/2010 04:45 AM, James F. Reid wrote:
> Hi Steve,
>
> Term(names(get(get("NM_172496", org.Mm.egREFSEQ2EG), org.Mm.egGO)))
> GO:0001843 GO:0005515
> "neural tube closure" "protein binding"
I'm partial to
library(org.Mm.eg.db) # organism-specific library
library(GO.db) # GO ontology
## a vector of REFSEQ ids. in org.*eg.db packages the 'Lkey' is the
## 'eg' part of the package name, i.e., the ENTREZ gene id, while
## 'Rkey' is the part of the thing that is getting mapped to,
## 'mappedRkeys' are those keys that are in the present map
## so here we get the first three REFSEQ ids, to be used as
## an example
rids <- mappedRkeys(head(org.Mm.egREFSEQ2EG, 3))
Then the maps
egids <- org.Mm.egREFSEQ2EG[rids] # REFSEQ to ENTREZ id
goids <- org.Mm.egGO[mappedLkeys(egids)] # ENTREZ to GO id
terms <- GOTERM[mappedRkeys(goids)] # GO to TERM
we could see what we've got, e.g.,
toTable(terms)
or maybe
unique(toTable(terms)[,c("go_id", "Term")])
or more explicitly
r2eg <- toTable(egids)
eg2go <- toTable(goids)
go2term <- unique(toTable(terms)[,c('go_id', 'Term')])
merge(merge(r2eg, eg2go), go2term)
The first few lines of which are
> head(merge(merge(r2eg, eg2go), go2term))
go_id gene_id accession Evidence Ontology Term
1 GO:0001666 235623 NM_001001144 IMP BP response to hypoxia
2 GO:0003674 19783 NG_005612 ND MF molecular_function
3 GO:0003674 22746 NM_001001130 ND MF molecular_function
4 GO:0005515 235623 NM_001001144 IPI MF protein binding
5 GO:0005575 19783 NG_005612 ND CC cellular_component
6 GO:0005575 22746 NM_001001130 ND CC cellular_component
An alternative to map a single key might be
Term(names(org.Mm.egGO[[ org.Mm.egREFSEQ2EG[["NM_172496"]] ]]))
Martin
>
> HTH,
> J.
>
> On 05/21/2010 12:31 PM, Steve Taylor wrote:
>> Hi,
>>
>> I too would like a simple way of getting from Refseq to GOTERM(s).
>>
>> What's the best package (and an example if possible) for getting the
>> actual term information (rather than the GO ID as below) from a Refseq
>> ID?
>>
>> Thanks,
>>
>> Steve
>>
>>>
>>>> Hello,
>>>>
>>>> I'm not sure how to retrieve GO terms associated with the NCBI
>>>> accession numbers (such as "NM_172496").
>>>>
>>>> I have found references to GOLOCUSID, but I cannot find this
>>>> environment. I have GOstats and I can access GOTERM, but not
>>>> GOLOCUSID.
>>>>
>>>>
>>> Perhaps this will get you going:
>>>
>>>> library(org.Mm.eg.db)
>>>> get("NM_172496", org.Mm.egREFSEQ2EG)
>>> [1] "12808"
>>>> names(get("12808", org.Mm.egGO))
>>> [1] "GO:0001843" "GO:0005515"
>>>
>>>> sessionInfo()
>>> R version 2.12.0 Under development (unstable) (2010-05-03 r51901)
>>> x86_64-apple-darwin10.3.0
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices datasets tools utils methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] org.Mm.eg.db_2.4.1 org.Hs.eg.db_2.4.1 RSQLite_0.9-0
>>> [4] DBI_0.2-5 AnnotationDbi_1.11.1 Biobase_2.9.0
>>> [7] weaver_1.15.0 codetools_0.2-2 digest_0.4.2
>>>
>>>
>>>
>>>> Anyways, I also failed to map NCBI accession numbers to Entrez IDs
>>>> using BioIDMapper:
>>>>
>>>
>>> Not bioconductor; please contact the author of that package for concerns
>>> about it.
>>>
>>>
>>>>
>>>> library(BioIDMapper)
>>>> data(glist)
>>>>> head( bio.convert( glist, 1, 24 ) )
>>>> Parsing data from UniProt
>>>> 200 IDs have been processed
>>>> 159 IDs have been processed
>>>> Parsing data from UniProt
>>>> 22 IDs have been processed
>>>> No ID found in database. 0 IDs have been processed
>>>> Done...
>>>> P_GI ACC P_ENTREZGENEID
>>>> 1 "54125119" "A6YK35\r" NA
>>>> 2 "54125311" "A6YK35\r" NA
>>>> 3 "54125051" "A6YK35\r" NA
>>>> 4 "54125369" "A6YK35\r" NA
>>>> 5 "54125435" "A7J4K5\r" NA
>>>> 6 "54125083" "A6YK35\r" NA
>>>>>
>>>>
>>>> Best regards,
>>>>
>>>> confused January
>>>>
>>>> --
>>>> -------- Dr. January Weiner 3 --------------------------------------
>>>> Max Planck Institute for Infection Biology
>>>> Charitéplatz 1
>>>> D-10117 Berlin, Germany
>>>> Web : www.mpiib-berlin.mpg.de
>>>> Tel : +49-30-28460514
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list