[BioC] help with protein IPI annotation mappings
Sean Davis
sdavis2 at mail.nih.gov
Mon Jan 8 16:47:49 CET 2007
On Monday 08 January 2007 10:22, Steffen Durinck wrote:
> Hi Mark,
>
> I quickly scanned the attributes and filters and it looks like you
> currently can not use genbank accession numbers with Ensembl.
> To be sure you could ask the Ensembl helpdesk: helpdesk at ensembl.org if
> genbank accession numbers are in their database and what the name of the
> corresponding filter is. If they don't have genbank ids you could ask
> them if there is a possibility to include genbank ids in future releases.
> Whatever information Ensembl makes available is retrievable through the
> biomaRt package and questions or suggestions related to the data
> present in Ensembl can be best addressed to their helpdesk. Make sure
> you let them know you are using the BioMart version of Ensembl.
>
> Cheers,
> Steffen
>
> Kimpel, Mark William wrote:
> > Steffen,
> >
> > Your code to convert IPI to entrezgene ID's worked like charm. Now I
> > have run into another problem. I have discovered that some of the ID's I
> > need to map are GenBank ID's of the form (GI:XXXX). I have used
> > listAttributes(ensembl) and cannot figure out which, if any correspond
> > to the NCBI GI. A previous post in this list indicated that this should
> > be possible, but I must be missing something.
This can be accomplished with eutils from NCBI pretty easily. If you have a
GI number (without the 'GI:') like:
47078294 (which corresponds to refseq NM_000022, just for example)
You can use eLink to get the reference to the Entrez Gene database, if you
like, by doing:
readLines(url('http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=nucleotide&db=gene&id=47078294'))
This will return XML and the <Id>100</Id> tag is the Gene ID of that GI
number. I show here just the readLines output, but you could also use the
XML package to do the parsing of the output if you liked. If you loop over
your GI numbers, you can retrieve them all. Be sure to leave a little time
between queries so that you don't set off any alarms at NCBI about too many
queries in too little time.
Hope that helps.
Sean
More information about the Bioconductor
mailing list