[BioC] How to map KEGG gene IDs to gene names?
Martin Morgan
mtmorgan at fhcrc.org
Fri Dec 28 00:46:43 CET 2007
As a quick follow-up to myself, and to indicate how unreliable my info
is on this, from
http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:40
http://www.genome.jp/dbget/dbget_manual.html
'bget' returns at most 100 entries (hence length(records)==100) and
additional options embedded in the character string argument to bget
influence the type of data returned. Perhaps there are other things
I'm missing, too, and there are better alternatives to the screen
scraping I mentioned?
Martin
Martin Morgan <mtmorgan at fhcrc.org> writes:
> Hi Elliot -- not sure that this is the way to go here, but...
>
>> details <- bget(paste(csp.genes.rno, collapse=" "))
>> nchar(details) # one long character string
> [1] 515265
>> records <- strsplit(details, "///\\n")[[1]] # ///\n separates records
>> length(records)
> [1] 100
>> length(unique(csp.genes.rno)) # hmm, a few missing...
> [1] 165
>> cat(records[[1]]) # 1 record, 1 character string; '\n' separates lines
> ENTRY 113995 CDS R.norvegicus
> NAME P2rx5
> DEFINITION purinergic receptor P2X, ligand-gated ion channel, 5
> ORTHOLOGY KO: K05219 purinergic receptor P2X, ligand-gated ion channel 5
> PATHWAY PATH: rno04020 Calcium signaling pathway
> PATH: rno04080 Neuroactive ligand-receptor interaction
> POSITION 10q24
> MOTIF Pfam: P2X_receptor
> PROSITE: P2X_RECEPTOR
> DBLINKS RGD: 620256
> NCBI-GI: 31377508
> NCBI-GeneID: 113995
> Ensembl: ENSRNOG00000019208
> UniProt: P51578
> CODON_USAGE T C A G
> T 8 20 0 4 9 8 1 0 5 10 0 1 7 5 0 6
> C 5 9 2 18 5 4 6 0 2 6 1 17 4 5 2 7
> A 8 16 6 6 5 7 5 3 7 16 11 22 4 5 0 7
> G 6 7 5 18 8 14 5 2 9 17 7 21 5 9 4 14
> AASEQ 455
> MGQAAWKGFVLSLFDYKTAKFVVAKSKKVGLLYRVLQLIILLYLLIWVFLIKKSYQDIDT
> SLQSAVVTKVKGVAYTNTTMLGERLWDVADFVIPSQGENVFFVVTNLIVTPNQRQGICAE
> REGIPDGECSEDDDCHAGESVVAGHGLKTGRCLRVGNSTRGTCEIFAWCPVETKSMPTDP
> LLKDAESFTISIKNFIRFPKFNFSKANVLETDNKHFLKTCHFSSTNLYCPIFRLGSIVRW
> AGADFQDIALKGGVIGIYIEWDCDLDKAASKCNPHYYFNRLDNKHTHSISSGYNFRFARY
> YRDPNGVEFRDLMKAYGIRFDVIVNGKAGKFSIIPTVINIGSGLALMGAGAFFCDLVLIY
> LIRKSEFYRDKKFEKVRGQKEDANVEVEANEMEQERPEDEPLERVRQDEQSQELAQSGRK
> QNSNCQVLLEPARFGLRENAIVNVKQSQILHPVKT
> NTSEQ 1368
> atgggccaggcggcctggaaggggtttgtgctgtctctgttcgactataagactgcaaag
> ttcgtggtcgccaagagcaagaaggtggggctgctctaccgggtgctgcagctcatcatc
> ctgttgtacttgctcatatgggtgtttctgataaagaagagttatcaggacattgacact
> tccctgcagagtgctgtggtcaccaaagtcaagggggtggcctatactaacaccacgatg
> cttggggaacggctctgggatgtagcagactttgtcattccatctcagggggagaacgtt
> ttcttcgtggtcaccaacctgatcgtgactcctaaccagcggcagggcatctgcgctgag
> cgtgaaggcatccctgatggcgagtgttctgaggacgatgactgtcacgctggggagtct
> gttgtagctgggcacggactgaaaactggccgctgtctccgggtggggaactctacccgg
> ggaacctgtgagatctttgcttggtgcccagtggagacaaagtccatgccaacggatccc
> cttctaaaggatgccgaaagcttcaccatttccataaagaacttcattcgcttccccaag
> ttcaacttctccaaagccaatgtactagaaacagacaacaaacatttcctgaaaacctgt
> cacttcagctccacaaatctctactgtcccatcttccgactggggtctattgtccgctgg
> gcaggggcagacttccaggacatagccctgaagggtggtgtgataggaatctatattgaa
> tgggactgtgaccttgataaagctgcctctaaatgcaacccacactactacttcaaccgc
> ctggacaacaaacacacacactccatctcctctgggtacaacttcaggttcgccaggtat
> taccgtgaccctaatggggtagagttccgtgacctgatgaaagcctacggcatccgcttt
> gatgtgatagttaatggcaaggcaggaaaattcagcatcatccccacagtcatcaacatt
> ggttctgggctggcgctcatgggtgctggggctttcttctgcgacctggtacttatctac
> ctcatcaggaagagtgagttttaccgagacaagaagtttgagaaagtgaggggtcagaag
> gaggatgccaatgttgaggttgaggccaacgagatggagcaggagcggcctgaggacgaa
> ccactggagagggttcgtcaggatgagcagtcccaagaactggcccagagtggcaggaag
> cagaatagcaactgccaggtgcttttggagcctgccaggtttggcctccgggagaatgcc
> attgtgaacgtgaagcagtcacagatcttgcatccagtgaagacgtag
>
>>From here it seems like you're stuck 'screen scraping', e.g.,
>
>> ids <- sub("^ENTRY[[:space:]]+([[:digit:]]+).*", "\\1", records)
>> ids
> [1] "113995" "114098" "114099" "114115" "114207" "114493" "114633" "116601"
> [9] "140447" "140448" "140671" "140693" "170546" "170897" "170926" "171140"
> [17] "171378" "24173" "24176" "24180" "24215" "24239" "24242" "24244"
> [25] "24245" "24246" "24260" "24316" "24326" "24329" "24337" "24408"
> [33] "24409" "24411" "24412" "24414" "24418" "24448" "24598" "24599"
> [41] "24600" "24611" "24629" "24654" "24655" "24674" "24675" "24680"
> [49] "24681" "24807" "24808" "24816" "24889" "24896" "24925" "24929"
> [57] "24938" "25007" "25023" "25031" "25041" "25050" "25107" "25176"
> [65] "25187" "25229" "25245" "25262" "25267" "252859" "25302" "25324"
> [73] "25342" "25369" "25391" "25400" "25439" "25461" "25477" "25505"
> [81] "25570" "25636" "25637" "25645" "25652" "25668" "25679" "25689"
> [89] "25706" "25738" "257648" "287745" "288057" "290561" "291926" "29241"
> [97] "29316" "29322" "29337" "293508"
>
> Martin
>
> Elliot Kleiman <kleiman at rohan.sdsu.edu> writes:
>
>> Hi BioC List from {sunny}San Diego, CA!
>>
>> [Question]:
>> * How do you map KEGG gene IDs to textual gene names, gene descriptions
>> via BioC?
>>
>> For example, I am interested in knowing which genes are
>> involved in the calcium signaling pathway in rattus norvegicus,
>> so I did:
>>
>> > library(KEGG)
>> > # map pathway id to pathway name
>> > KEGGPATHID2NAME$"04020"
>> [1] "Calcium signaling pathway"
>>
>> > library(KEGGSOAP)
>> > # get all genes in pathway rno04020
>> > csp.genes.rno <- get.genes.by.pathway("path:rno04020")
>> > # how many genes are involved?
>> > length(csp.genes.rno)
>> [1] 165
>> > # print a few of the results out
>> > csp.genes.rno[1:3]
>> [1] "rno:113995" "rno:114098" "rno:114099"
>>
>> The problem is, I don't know what "rno:113995" refers to?
>> [not without visiting the KEGG website]
>> Instead, I would like to obtain a mapping for each of the retrieved KEGG
>> gene IDs into textual gene names, gene descriptions, etc.
>>
>> How do you do that exactly?
>>
>> Thank you,
>>
>> Elliot Kleiman
>>
>> > # print SessionInfo
>> > sessionInfo()
>> R version 2.6.1 (2007-11-26)
>> i686-pc-linux-gnu
>>
>> locale:
>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=C;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] KEGG_2.0.1 KEGGSOAP_1.12.0 SSOAP_0.4-6 RCurl_0.8-3
>> [5] XML_1.93-2
>>
>> loaded via a namespace (and not attached):
>> [1] rcompgen_0.1-17 tools_2.6.1
>>
>> --
>> __________________________
>> MS graduate student
>> Program in Computational Science
>> San Diego State University
>> http://www.csrc.sdsu.edu/
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M2 B169
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioconductor
mailing list