[BioC] help with protein IPI annotation mappings

Tue Jan 9 03:11:17 CET 2007

Sean and Paul,

Thanks for your help, it will work.

Mark

Mark W. Kimpel MD 

(317) 490-5129 Work, & Mobile

(317) 663-0513 Home (no voice mail please)

1-(317)-536-2730 FAX

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Paul Leo
Sent: Monday, January 08, 2007 6:54 PM
To: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] help with protein IPI annotation mappings

Sorry I've come in a bit late on this topic ..
Elink is a nice choice, you can also get the tab delimited flat file of
the IPI cross-reference database at:

ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/

1# Database from which master entry of this IPI entry has been taken.
One of either SP (UniProtKB/Swiss-Prot), TR (UniProtKB/TrEMBL), ENSEMBL
(Ensembl), ENSEMBL_HAVANA (Ensembl Havana subset), REFSEQ_STATUS (where
STATUS corresponds to the RefSeq entry revision status), VEGA (Vega),
TAIR (TAIR Protein data set) or HINV (H-Invitational Database).
2# UniProtKB accession number or Vega ID or Ensembl ID or RefSeq ID or
TAIR Protein ID or H-InvDB ID.
3# International Protein Index identifier.
4# Supplementary UniProtKB/Swiss-Prot entries associated with this IPI
entry.
5# Supplementary UniProtKB/TrEMBL entries associated with this IPI
entry.
6# Supplementary Ensembl entries associated with this IPI entry. Havana
curated transcripts preceeded by the key HAVANA: (e.g.
HAVANA:ENSP00000237305;ENSP00000356824;).
7# Supplementary list of RefSeq STATUS:ID couples (separated by a
semi-colon ';') associated with this IPI entry (RefSeq entry revision
status details).
8# Supplementary TAIR Protein entries associated with this IPI entry.
9# Supplementary H-Inv Protein entries associated with this IPI entry.
10# Protein identifiers (cross reference to EMBL/Genbank/DDBJ nucleotide
databases).
11# List of HGNC number, HGNC official gene symbol couples (separated by
a semi-colon ';') associated with this IPI entry.
12# List of NCBI Entrez Gene gene number, Entrez Gene Default Gene
Symbol couples (separated by a semi-colon ';') associated with this IPI
entry.
13# UNIPARC identifier associated with the sequence of this IPI entry.
14# UniGene identifiers associated with this IPI entry.
15# CCDS identifiers associated with this IPI entry.
16# RefSeq GI protein identifiers associated with this IPI entry.
17# Supplementary Vega entries associated with this IPI entry.

... see http://www.ebi.ac.uk/IPI/xrefs.html

Columns 3 an 7 would probably suite you and would be easy to read into
R. Actually you should probably choose columns 3 and 7 when column 1 is
REFSEQ_*. (note you can also get the mysql dump of this database which
is even better if you know some SQL). There might be only a few missing
(no REFSEQ) that you can get with elink as Sean suggests. 

Cheers
Paul Leo

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Sean Davis
Sent: Tuesday, 9 January 2007 1:48 AM
To: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] help with protein IPI annotation mappings

On Monday 08 January 2007 10:22, Steffen Durinck wrote:
> Hi Mark,
>
> I quickly scanned the attributes and filters and it looks like you
> currently can not use genbank accession numbers with Ensembl.
> To be sure you could ask the Ensembl helpdesk:  helpdesk at ensembl.org
if
> genbank accession numbers are in their database and what the name of
the
> corresponding filter is.  If they don't have genbank ids you could ask
> them if there is a possibility to include genbank ids in future
releases.
> Whatever information Ensembl makes available is retrievable through
the
> biomaRt package and questions or suggestions related to the data
> present  in Ensembl can be best addressed to their helpdesk.  Make
sure
> you let them know you are using the BioMart version of Ensembl.
>
> Cheers,
> Steffen
>
> Kimpel, Mark William wrote:
> > Steffen,
> >
> > Your code to convert IPI to entrezgene ID's worked like charm. Now I
> > have run into another problem. I have discovered that some of the
ID's I
> > need to map are GenBank ID's of the form (GI:XXXX). I have used
> > listAttributes(ensembl) and cannot figure out which, if any
correspond
> > to the NCBI GI. A previous post in this list indicated that this
should
> > be possible, but I must be missing something.

This can be accomplished with eutils from NCBI pretty easily.  If you
have a 
GI number (without the 'GI:') like:

47078294 (which corresponds to refseq NM_000022, just for example)

You can use eLink to get the reference to the Entrez Gene database, if
you 
like, by doing:

readLines(url('http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfr
om=nucleotide&db=gene&id=47078294'))

This will return XML and the <Id>100</Id> tag is the Gene ID of that GI 
number.  I show here just the readLines output, but you could also use
the 
XML package to do the parsing of the output if you liked.  If you loop
over 
your GI numbers, you can retrieve them all.  Be sure to leave a little
time 
between queries so that you don't set off any alarms at NCBI about too
many 
queries in too little time.  

Hope that helps.

Sean

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor