[BioC] How to map KEGG gene IDs to gene names?
Martin Morgan
mtmorgan at fhcrc.org
Fri Dec 28 15:23:35 CET 2007
Hi Elliot --
Actually, if you're comfortable at that level, then you might take a
peek 'under the hood' at the R functions in KEGGSOAP -- basically, I
think (I have not explored this) you have access to the complete KEGG
SOAP API within R, no need for RSPerl.
Martin
> library(KEGGSOAP)
> get.genes.by.pathway
function (pathway.id)
{
return(unlist(.SOAP(KEGGserver, "get_genes_by_pathway", .soapArgs = list(pathway_id = pathway.id),
action = KEGGaction, xmlns = KEGGxmlns, nameSpaces = SOAPNameSpaces(version = KEGGsoapns))))
}
<environment: namespace:KEGGSOAP>
> KEGGSOAP:::KEGGserver
[1] "http://soap.genome.jp/keggapi/request_v6.0.cgi"
Elliot Kleiman <kleiman at rohan.sdsu.edu> writes:
> Hi Martin,
>
> Wow, that is great info! I think I can really use the
> KEGG API to obtain the info I need (e.g. using Perl's
> SOAP::Lite). Also, I just discovered a very interesting
> package offered by the Omega project, called
> `RSPerl`.
>
> "RSPerl provides a bidirectional interface for calling
> R from Perl and Perl from R."
> * http://www.omegahat.org/RSPerl/
>
> Thank you so much for your help!
>
> Elliot Kleiman
>
>
> Martin Morgan wrote:
>> As a quick follow-up to myself, and to indicate how unreliable my info
>> is on this, from
>>
>> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
>> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:40
>> http://www.genome.jp/dbget/dbget_manual.html
>>
>> 'bget' returns at most 100 entries (hence length(records)==100) and
>> additional options embedded in the character string argument to bget
>> influence the type of data returned. Perhaps there are other things
>> I'm missing, too, and there are better alternatives to the screen
>> scraping I mentioned?
>>
>> Martin
>>
>> Martin Morgan <mtmorgan at fhcrc.org> writes:
>>
>>
>>> Hi Elliot -- not sure that this is the way to go here, but...
>>>
>>>
>>>> details <- bget(paste(csp.genes.rno, collapse=" "))
>>>> nchar(details) # one long character string
>>>>
>>> [1] 515265
>>>
>>>> records <- strsplit(details, "///\\n")[[1]] # ///\n separates records
>>>> length(records)
>>>>
>>> [1] 100
>>>
>>>> length(unique(csp.genes.rno)) # hmm, a few missing...
>>>>
>>> [1] 165
>>>
>>>> cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines
>>>>
>>> ENTRY 113995 CDS R.norvegicus
>>> NAME P2rx5
>>> DEFINITION purinergic receptor P2X, ligand-gated ion channel, 5
>>> ORTHOLOGY KO: K05219 purinergic receptor P2X, ligand-gated ion channel 5
>>> PATHWAY PATH: rno04020 Calcium signaling pathway
>>> PATH: rno04080 Neuroactive ligand-receptor interaction
>>> POSITION 10q24
>>> MOTIF Pfam: P2X_receptor
>>> PROSITE: P2X_RECEPTOR
>>> DBLINKS RGD: 620256
>>> NCBI-GI: 31377508
>>> NCBI-GeneID: 113995
>>> Ensembl: ENSRNOG00000019208
>>> UniProt: P51578
>>> CODON_USAGE T C A G
>>> T 8 20 0 4 9 8 1 0 5 10 0 1 7 5 0 6
>>> C 5 9 2 18 5 4 6 0 2 6 1 17 4 5 2 7
>>> A 8 16 6 6 5 7 5 3 7 16 11 22 4 5 0 7
>>> G 6 7 5 18 8 14 5 2 9 17 7 21 5 9 4 14
>>> AASEQ 455
>>> MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT
>>> SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE
>>> REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP
>>> LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW
>>> AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY
>>> YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY
>>> LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK
>>> QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT
>>> NTSEQ 1368
>>> atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag
>>> ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc
>>> ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact
>>> tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg
>>> cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt
>>> ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag
>>> cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct
>>> gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg
>>> ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc
>>> cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag
>>> ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt
>>> cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg
>>> gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa
>>> tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc
>>> ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat
>>> taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt
>>> gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt
>>> ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac
>>> ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag
>>> gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa
>>> ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag
>>> cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc
>>> attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag
>>>
>>> >From here it seems like you're stuck 'screen scraping', e.g.,
>>>
>>>
>>>> ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records)
>>>> ids
>>>>
>>> [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601"
>>> [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140"
>>> [17] "171378" "24173" "24176" "24180" "24215" "24239" "24242"
>>> "24244" [25] "24245" "24246" "24260" "24316" "24326" "24329"
>>> "24337" "24408" [33] "24409" "24411" "24412" "24414" "24418"
>>> "24448" "24598" "24599" [41] "24600" "24611" "24629" "24654"
>>> "24655" "24674" "24675" "24680" [49] "24681" "24807" "24808"
>>> "24816" "24889" "24896" "24925" "24929" [57] "24938" "25007"
>>> "25023" "25031" "25041" "25050" "25107" "25176" [65] "25187"
>>> "25229" "25245" "25262" "25267" "252859" "25302" "25324" [73]
>>> "25342" "25369" "25391" "25400" "25439" "25461" "25477"
>>> "25505" [81] "25570" "25636" "25637" "25645" "25652" "25668"
>>> "25679" "25689" [89] "25706" "25738" "257648" "287745" "288057"
>>> "290561" "291926" "29241" [97] "29316" "29322" "29337" "293508"
>>>
>>> Martin
>>>
>>> Elliot Kleiman <kleiman at rohan.sdsu.edu> writes:
>>>
>>>
>>>> Hi BioC List from {sunny}San Diego, CA!
>>>>
>>>> [Question]:
>>>> * How do you map KEGG gene IDs to textual gene names, gene descriptions
>>>> via BioC?
>>>>
>>>> For example, I am interested in knowing which genes are
>>>> involved in the calcium signaling pathway in rattus norvegicus,
>>>> so I did:
>>>>
>>>> > library(KEGG)
>>>> > # map pathway id to pathway name
>>>> > KEGGPATHID2NAME$"04020"
>>>> [1] "Calcium signaling pathway"
>>>>
>>>> > library(KEGGSOAP)
>>>> > # get all genes in pathway rno04020
>>>> > csp.genes.rno <- get.genes.by.pathway("path:rno04020")
>>>> > # how many genes are involved?
>>>> > length(csp.genes.rno)
>>>> [1] 165
>>>> > # print a few of the results out
>>>> > csp.genes.rno[1:3]
>>>> [1] "rno:113995" "rno:114098" "rno:114099"
>>>>
>>>> The problem is, I don't know what "rno:113995" refers to?
>>>> [not without visiting the KEGG website]
>>>> Instead, I would like to obtain a mapping for each of the retrieved KEGG
>>>> gene IDs into textual gene names, gene descriptions, etc.
>>>>
>>>> How do you do that exactly?
>>>>
>>>> Thank you,
>>>>
>>>> Elliot Kleiman
>>>>
>>>> > # print SessionInfo
>>>> > sessionInfo()
>>>> R version 2.6.1 (2007-11-26)
>>>> i686-pc-linux-gnu
>>>>
>>>> locale:
>>>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3
>>>> [5] XML_1.93-2
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] rcompgen_0.1-17 tools_2.6.1
>>>>
>>>> --
>>>> __________________________
>>>> MS graduate student Program in Computational Science San Diego
>>>> State University
>>>> http://www.csrc.sdsu.edu/
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M2 B169
>>> Phone: (206) 667-2793
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
>
> --
> __________________________
> MS graduate student Program in Computational Science San Diego State
> University
> http://www.csrc.sdsu.edu/
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioconductor
mailing list