[BioC] IPI to entrez id

Dick Beyer dbeyer at u.washington.edu
Tue Mar 8 20:39:50 CET 2011


Hi to all,

For several years now, I have been doing GO analysis on lists of proteins derived from MS.  I am given IPIs by the proteomics folks and need the corresponding Entrez Gene IDs.  Putting aside the issues of non-unique mapping from IPI to EG, isoforms, etc., I was wondering if anyone would comment on my method of getting the Entrez Gene IDs. I'd really like to use Marc Carlson's merge method (shown below), but that approach seems to miss several thousand IPI/EG matches that my method finds.

I start with ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and extract a subset of the rows:

ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011
dbfetch.all <- ipiHUMAN
rm(ipiHUMAN)

# Explanation of the data format is found here
# http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss

length(dbfetch.all) #  3180244
length(eg  <- grep("^DR   Entrez Gene",    dbfetch.all)) # 80296
length(ids <- grep("^ID",                  dbfetch.all)) # 86719
length(de  <- grep("^DE",                  dbfetch.all)) # 92454
length(ac  <- grep("^AC",                  dbfetch.all)) # 93720
length(ug  <- grep("^DR   UniGene",        dbfetch.all)) # 88314
length(up  <- grep("^DR   UniProtKB",      dbfetch.all)) # 110593
length(en  <- grep("^DR   ENSEMBL",        dbfetch.all)) # 77340
length(rs  <- grep("^DR   REFSEQ_REVIEWED",dbfetch.all)) # 14559

and eventually turn this into a data.frame with the columns:

"IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEWED"

(Note: Not every IPI entry has every field)

For this build of the IPI file, my data.frame ends up as
dim(dat.all)
[1] 183153      7

Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique Entrez Gene IDs.

The merge method shown below from Marc Carlson gives 69315 unique IPIs and 17783 unique Entrez Gene IDs (you get the same numbers whether you use org.Hs.egGO2ALLEGS or org.Hs.egGO).

When I build my 7 column data.frame, I initially get 22305 unique Entrez Gene IDs, and I then go through some additional steps of trying to fill in the missing EGs.  I do this by taking the IPIs with no EGs, and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(), and hope I get a few more EGs.

For example:

library(biomaRt)
mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
length(which(is.na(dat.all[,4])))
sum(z <- !is.na(dat.all[,4]))
w <- getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description"),filters="unigene",values=dat.all[z,4], mart=mart)

By doing several of these getBM() steps, I add 37 more EGs!

My method is long and painful.  That merge approach is clean and beautiful.

Is there a way to add to the merge argument or something that would give me the additional 100K+ IPIs and 4500+ EGs?

------------------------------
Message: 20
Date: Fri, 18 Feb 2011 13:17:18 -0800
From: "Carlson, Marc R" <mcarlson at fhcrc.org>
To: <bioconductor at stat.math.ethz.ch>
Subject: Re: [BioC] IPI to entrez id
Message-ID:
 	<1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org>
Content-Type: text/plain; charset="utf-8"

Hi Viritha,

These things can never be 1:1, but you can pretty easily just cram them all into a huge data.frame by doing this:

library(org.Hs.eg.db)
allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id")
head(allAnnots)

Once you have done this, you may notice that they are not only are these things almost never (if ever) 1:1, but that this could have been even worse if I had used the GO2ALL mappings (and I probably should have, but I can't really tell because I have almost no information about what you want to do).  Anyhow, I hope this helps you, but if you have a more specific use for this information that you are willing to talk about then we might be able to give you a better answer.


   Marc
------------------------------

Thanks very much,
Dick
*******************************************************************************
Richard P. Beyer, Ph.D.	University of Washington
Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
 			Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/members_fc_bioinfo.html
http://staff.washington.edu/~dbeyer



More information about the Bioconductor mailing list