[BioC] How to map KEGG gene IDs to gene names?
Elliot Kleiman
kleiman at rohan.sdsu.edu
Fri Dec 28 09:06:21 CET 2007
Hi Martin,
Wow, that is great info! I think I can really use the
KEGG API to obtain the info I need (e.g. using Perl's
SOAP::Lite). Also, I just discovered a very interesting
package offered by the Omega project, called
`RSPerl`.
"RSPerl provides a bidirectional interface for calling
R from Perl and Perl from R."
* http://www.omegahat.org/RSPerl/
Thank you so much for your help!
Elliot Kleiman
Martin Morgan wrote:
> As a quick follow-up to myself, and to indicate how unreliable my info
> is on this, from
>
> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
> http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:40
> http://www.genome.jp/dbget/dbget_manual.html
>
> 'bget' returns at most 100 entries (hence length(records)==100) and
> additional options embedded in the character string argument to bget
> influence the type of data returned. Perhaps there are other things
> I'm missing, too, and there are better alternatives to the screen
> scraping I mentioned?
>
> Martin
>
> Martin Morgan <mtmorgan at fhcrc.org> writes:
>
>
>> Hi Elliot -- not sure that this is the way to go here, but...
>>
>>
>>> details <- bget(paste(csp.genes.rno, collapse=" "))
>>> nchar(details) # one long character string
>>>
>> [1] 515265
>>
>>> records <- strsplit(details, "///\\n")[[1]] # ///\n separates records
>>> length(records)
>>>
>> [1] 100
>>
>>> length(unique(csp.genes.rno)) # hmm, a few missing...
>>>
>> [1] 165
>>
>>> cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines
>>>
>> ENTRY 113995 CDS R.norvegicus
>> NAME P2rx5
>> DEFINITION purinergic receptor P2X, ligand-gated ion channel, 5
>> ORTHOLOGY KO: K05219 purinergic receptor P2X, ligand-gated ion channel 5
>> PATHWAY PATH: rno04020 Calcium signaling pathway
>> PATH: rno04080 Neuroactive ligand-receptor interaction
>> POSITION 10q24
>> MOTIF Pfam: P2X_receptor
>> PROSITE: P2X_RECEPTOR
>> DBLINKS RGD: 620256
>> NCBI-GI: 31377508
>> NCBI-GeneID: 113995
>> Ensembl: ENSRNOG00000019208
>> UniProt: P51578
>> CODON_USAGE T C A G
>> T 8 20 0 4 9 8 1 0 5 10 0 1 7 5 0 6
>> C 5 9 2 18 5 4 6 0 2 6 1 17 4 5 2 7
>> A 8 16 6 6 5 7 5 3 7 16 11 22 4 5 0 7
>> G 6 7 5 18 8 14 5 2 9 17 7 21 5 9 4 14
>> AASEQ 455
>> MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT
>> SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE
>> REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP
>> LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW
>> AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY
>> YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY
>> LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK
>> QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT
>> NTSEQ 1368
>> atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag
>> ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc
>> ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact
>> tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg
>> cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt
>> ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag
>> cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct
>> gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg
>> ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc
>> cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag
>> ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt
>> cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg
>> gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa
>> tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc
>> ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat
>> taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt
>> gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt
>> ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac
>> ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag
>> gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa
>> ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag
>> cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc
>> attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag
>>
>> >From here it seems like you're stuck 'screen scraping', e.g.,
>>
>>
>>> ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records)
>>> ids
>>>
>> [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601"
>> [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140"
>> [17] "171378" "24173" "24176" "24180" "24215" "24239" "24242" "24244"
>> [25] "24245" "24246" "24260" "24316" "24326" "24329" "24337" "24408"
>> [33] "24409" "24411" "24412" "24414" "24418" "24448" "24598" "24599"
>> [41] "24600" "24611" "24629" "24654" "24655" "24674" "24675" "24680"
>> [49] "24681" "24807" "24808" "24816" "24889" "24896" "24925" "24929"
>> [57] "24938" "25007" "25023" "25031" "25041" "25050" "25107" "25176"
>> [65] "25187" "25229" "25245" "25262" "25267" "252859" "25302" "25324"
>> [73] "25342" "25369" "25391" "25400" "25439" "25461" "25477" "25505"
>> [81] "25570" "25636" "25637" "25645" "25652" "25668" "25679" "25689"
>> [89] "25706" "25738" "257648" "287745" "288057" "290561" "291926" "29241"
>> [97] "29316" "29322" "29337" "293508"
>>
>> Martin
>>
>> Elliot Kleiman <kleiman at rohan.sdsu.edu> writes:
>>
>>
>>> Hi BioC List from {sunny}San Diego, CA!
>>>
>>> [Question]:
>>> * How do you map KEGG gene IDs to textual gene names, gene descriptions
>>> via BioC?
>>>
>>> For example, I am interested in knowing which genes are
>>> involved in the calcium signaling pathway in rattus norvegicus,
>>> so I did:
>>>
>>> > library(KEGG)
>>> > # map pathway id to pathway name
>>> > KEGGPATHID2NAME$"04020"
>>> [1] "Calcium signaling pathway"
>>>
>>> > library(KEGGSOAP)
>>> > # get all genes in pathway rno04020
>>> > csp.genes.rno <- get.genes.by.pathway("path:rno04020")
>>> > # how many genes are involved?
>>> > length(csp.genes.rno)
>>> [1] 165
>>> > # print a few of the results out
>>> > csp.genes.rno[1:3]
>>> [1] "rno:113995" "rno:114098" "rno:114099"
>>>
>>> The problem is, I don't know what "rno:113995" refers to?
>>> [not without visiting the KEGG website]
>>> Instead, I would like to obtain a mapping for each of the retrieved KEGG
>>> gene IDs into textual gene names, gene descriptions, etc.
>>>
>>> How do you do that exactly?
>>>
>>> Thank you,
>>>
>>> Elliot Kleiman
>>>
>>> > # print SessionInfo
>>> > sessionInfo()
>>> R version 2.6.1 (2007-11-26)
>>> i686-pc-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3
>>> [5] XML_1.93-2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] rcompgen_0.1-17 tools_2.6.1
>>>
>>> --
>>> __________________________
>>> MS graduate student
>>> Program in Computational Science
>>> San Diego State University
>>> http://www.csrc.sdsu.edu/
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M2 B169
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
--
__________________________
MS graduate student
Program in Computational Science
San Diego State University
http://www.csrc.sdsu.edu/
More information about the Bioconductor
mailing list