[BioC] How to map KEGG gene IDs to gene names?

Martin Morgan mtmorgan at fhcrc.org
Fri Dec 28 00:46:43 CET 2007


As a quick follow-up to myself, and to indicate how unreliable my info
is on this, from

http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:40
http://www.genome.jp/dbget/dbget_manual.html

'bget' returns at most 100 entries (hence length(records)==100) and
additional options embedded in the character string argument to bget
influence the type of data returned.  Perhaps there are other things
I'm missing, too, and there are better alternatives to the screen
scraping I mentioned?

Martin

Martin Morgan <mtmorgan at fhcrc.org> writes:

> Hi Elliot -- not sure that this is the way to go here, but...
>
>> details <- bget(paste(csp.genes.rno, collapse=" "))
>> nchar(details) # one long character string
> [1] 515265
>> records <- strsplit(details, "///\\n")[[1]] # ///\n separates records
>> length(records)
> [1] 100
>> length(unique(csp.genes.rno)) # hmm, a few missing...
> [1] 165
>> cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines
> ENTRY       113995            CDS       R.norvegicus
> NAME        P2rx5
> DEFINITION  purinergic receptor P2X, ligand-gated ion channel, 5
> ORTHOLOGY   KO: K05219  purinergic receptor P2X, ligand-gated ion channel 5
> PATHWAY     PATH: rno04020  Calcium signaling pathway
>             PATH: rno04080  Neuroactive ligand-receptor interaction
> POSITION    10q24
> MOTIF       Pfam: P2X_receptor
>             PROSITE: P2X_RECEPTOR
> DBLINKS     RGD: 620256
>             NCBI-GI: 31377508
>             NCBI-GeneID: 113995
>             Ensembl: ENSRNOG00000019208
>             UniProt: P51578
> CODON_USAGE       T               C               A               G
>           T   8  20   0   4   9   8   1   0   5  10   0   1   7   5   0   6
>           C   5   9   2  18   5   4   6   0   2   6   1  17   4   5   2   7
>           A   8  16   6   6   5   7   5   3   7  16  11  22   4   5   0   7
>           G   6   7   5  18   8  14   5   2   9  17   7  21   5   9   4  14
> AASEQ       455
>             MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT
>             SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE
>             REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP
>             LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW
>             AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY
>             YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY
>             LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK
>             QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT
> NTSEQ       1368
>             atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag
>             ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc
>             ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact
>             tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg
>             cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt
>             ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag
>             cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct
>             gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg
>             ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc
>             cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag
>             ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt
>             cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg
>             gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa
>             tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc
>             ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat
>             taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt
>             gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt
>             ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac
>             ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag
>             gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa
>             ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag
>             cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc
>             attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag
>
>>From here it seems like you're stuck 'screen scraping', e.g.,
>
>> ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records)
>> ids
>   [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601"
>   [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140"
>  [17] "171378" "24173"  "24176"  "24180"  "24215"  "24239"  "24242"  "24244" 
>  [25] "24245"  "24246"  "24260"  "24316"  "24326"  "24329"  "24337"  "24408" 
>  [33] "24409"  "24411"  "24412"  "24414"  "24418"  "24448"  "24598"  "24599" 
>  [41] "24600"  "24611"  "24629"  "24654"  "24655"  "24674"  "24675"  "24680" 
>  [49] "24681"  "24807"  "24808"  "24816"  "24889"  "24896"  "24925"  "24929" 
>  [57] "24938"  "25007"  "25023"  "25031"  "25041"  "25050"  "25107"  "25176" 
>  [65] "25187"  "25229"  "25245"  "25262"  "25267"  "252859" "25302"  "25324" 
>  [73] "25342"  "25369"  "25391"  "25400"  "25439"  "25461"  "25477"  "25505" 
>  [81] "25570"  "25636"  "25637"  "25645"  "25652"  "25668"  "25679"  "25689" 
>  [89] "25706"  "25738"  "257648" "287745" "288057" "290561" "291926" "29241" 
>  [97] "29316"  "29322"  "29337"  "293508"
>
> Martin
>
> Elliot Kleiman <kleiman at rohan.sdsu.edu> writes:
>
>> Hi BioC List from {sunny}San Diego, CA!
>>
>> [Question]:
>> * How do you map KEGG gene IDs to textual gene names, gene descriptions
>> via BioC?
>>
>> For example, I am interested in knowing which genes are
>> involved in the calcium signaling pathway in rattus norvegicus,
>> so I did:
>>
>>  > library(KEGG)
>>  > # map pathway id to pathway name
>>  > KEGGPATHID2NAME$"04020"
>> [1] "Calcium signaling pathway"
>>
>>  > library(KEGGSOAP)
>>  > # get all genes in pathway rno04020
>>  > csp.genes.rno <- get.genes.by.pathway("path:rno04020")
>>  > # how many genes are involved?
>>  > length(csp.genes.rno)
>> [1] 165
>>  > # print a few of the results out
>>  > csp.genes.rno[1:3]
>> [1] "rno:113995" "rno:114098" "rno:114099"
>>
>> The problem is, I don't know what "rno:113995" refers to?
>> [not without visiting the KEGG website]
>> Instead, I would like to obtain a mapping for each of the retrieved KEGG
>> gene IDs into textual gene names, gene descriptions, etc.
>>
>> How do you do that exactly?
>>
>> Thank you,
>>
>> Elliot Kleiman
>>
>>  > # print SessionInfo
>>  > sessionInfo()
>> R version 2.6.1 (2007-11-26)
>> i686-pc-linux-gnu
>>
>> locale:
>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] KEGG_2.0.1      KEGGSOAP_1.12.0 SSOAP_0.4-6     RCurl_0.8-3
>> [5] XML_1.93-2
>>
>> loaded via a namespace (and not attached):
>> [1] rcompgen_0.1-17 tools_2.6.1
>>
>> -- 
>> __________________________
>> MS graduate student 
>> Program in Computational Science 
>> San Diego State University
>> http://www.csrc.sdsu.edu/
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioconductor mailing list