[BioC] IPI to entrez id

Thu Mar 10 19:04:03 CET 2011

Hi Dick,

So lets look at that example that you gave in your last post where you
merged the GO table with the PROSITE one.  It is important to understand
that when you call merge(), it does it's magic by basically performing
an inner join on the two tables that comprise it's 1st two first
arguments.  Therefore the mechanics of how that merge will do its job
mean that you have effectively restricted the results to only those
entrez genes where you have BOTH a GO annotation AND a PROSITE annotation. 

So the result you see (more EGs from the bigger join) would happen if
for example you had increased the size of the IPI tables (by pairing up
some deprecated IPI ids with some legitimate entrez gene IDs).  In this
situation, these entrez gene IDs would be perfectly legitimate, but
their IPI IDs would all be older deprecated IPI ids.  I am not sure if
that is what you really want or not, but if it is, then the final table
indeed would be bigger.

Hope that clarifies things,

  Marc

On 03/10/2011 07:16 AM, Dick Beyer wrote:
> Hi Marc,
>
> Thanks very much for the merge example.  That's so much cleaner than my usual approach.
>
> As far as the differing numbers of IPIs, I agree that using the AC field gives a lot more IPIs than just using the ID field.  I guess I'm trying to solve a different problem than other folks.  I get these lists of IPIs from proteomics folks, and some of the IPIs wouldn't show up in the ID field of "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz", but are present in the AC field because they are no longer the current primary identifier (http://www.ebi.ac.uk/IPI/Algorithm.html#SECONDARIES).  What I am really after is the Entrez Gene ID, so for me, the IPI is just a pointer to the EG.  
>
> However, with this approach:
>
> library(org.Hs.eg.db)
> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id")
>
> I get several thousand fewer EGs than using either ipi.HUMAN.dat or ipi.HUMAN.xrefs.  I looked at a few of the EGs that I get out of these that are not in org.Hs.eg.db, and they seem valid.  I'll do some more checking to be sure.  In ipi.HUMAN.xrefs I get 22113 unique EGs and 86719 unique IPIs from a download 2 days ago from "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz"
>
> What I had hoped I was creating was a table with any and all IPIs (primary and secondary) but each with valid EGs.
>
> Thanks very much for your great help,
> Dick
> ------------------------------
>
> Message: 24
> Date: Wed, 09 Mar 2011 17:25:34 -0800
> From: Marc Carlson <mcarlson at fhcrc.org>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] IPI to entrez id
> Message-ID: <4D78288E.6060904 at fhcrc.org>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi Dick,
>
> Is there any reason why something like this won't work for you to attach
> the GO Ids?
>
> merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id")
>
>
>
> When I build the org.Hs.eg.db package, I download the IPI Ids in the
> mySQL database from the following source:
>
> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current
>
>
> The most recent time I did this (last week) results in an ipi to gene
> table that contains only 86719 unique IPIs.  And that number drops a bit
> when I fold the IPI IDs into the org database (which takes as its set of
> unique entrez gene IDs the ones that are currently listed at NCBI) to
> 77387 distinct IPI IDs.  And if you merge with a GO table, I expect that
> it will drop a bit more.
>
> In looking at the file that you are parsing, I get exactly the same
> number of unique IPI ids if I extract from the ID fields and match to
> the Entrez Gene fields (which also gives me the exact same number of
> entrez gene IDs).  I can only get the huge additional number of IPI Ids
> from this file if I also mine the AC field and assume that these IPI Ids
> also should map to the exact same things.  But the direct database dump
> from EBI does not give me these mappings.  In fact, it does not seem to
> even contain them.  This causes me to be concerned that maybe these IDs
> may not what you think they are?
>
>
> Anyhow, I hope this helps,
>
>
>   Marc
>
>
>
>
> On 03/08/2011 11:39 AM, Dick Beyer wrote:
> Hi to all,
>
> For several years now, I have been doing GO analysis on lists of
> proteins derived from MS.  I am given IPIs by the proteomics folks and
> need the corresponding Entrez Gene IDs.  Putting aside the issues of
> non-unique mapping from IPI to EG, isoforms, etc., I was wondering if
> anyone would comment on my method of getting the Entrez Gene IDs. I'd
> really like to use Marc Carlson's merge method (shown below), but that
> approach seems to miss several thousand IPI/EG matches that my method
> finds.
>
> I start with
> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and
> extract a subset of the rows:
>
> ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011
> dbfetch.all <- ipiHUMAN
> rm(ipiHUMAN)
>
> # Explanation of the data format is found here
> # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss
>
> length(dbfetch.all) #  3180244
> length(eg  <- grep("^DR   Entrez Gene",    dbfetch.all)) # 80296
> length(ids <- grep("^ID",                  dbfetch.all)) # 86719
> length(de  <- grep("^DE",                  dbfetch.all)) # 92454
> length(ac  <- grep("^AC",                  dbfetch.all)) # 93720
> length(ug  <- grep("^DR   UniGene",        dbfetch.all)) # 88314
> length(up  <- grep("^DR   UniProtKB",      dbfetch.all)) # 110593
> length(en  <- grep("^DR   ENSEMBL",        dbfetch.all)) # 77340
> length(rs  <- grep("^DR   REFSEQ_REVIEWED",dbfetch.all)) # 14559
>
> and eventually turn this into a data.frame with the columns:
>
> "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEWED"
>
> (Note: Not every IPI entry has every field)
>
> For this build of the IPI file, my data.frame ends up as
> dim(dat.all)
> [1] 183153      7
>
> Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique
> Entrez Gene IDs.
>
> The merge method shown below from Marc Carlson gives 69315 unique IPIs
> and 17783 unique Entrez Gene IDs (you get the same numbers whether you
> use org.Hs.egGO2ALLEGS or org.Hs.egGO).
>
> When I build my 7 column data.frame, I initially get 22305 unique
> Entrez Gene IDs, and I then go through some additional steps of trying
> to fill in the missing EGs.  I do this by taking the IPIs with no EGs,
> and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(),
> and hope I get a few more EGs.
>
> For example:
>
> library(biomaRt)
> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
> length(which(is.na(dat.all[,4])))
> sum(z <- !is.na(dat.all[,4]))
> w <-
> getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description"),filters="unigene",values=dat.all[z,4],
> mart=mart)
>
> By doing several of these getBM() steps, I add 37 more EGs!
>
> My method is long and painful.  That merge approach is clean and
> beautiful.
>
> Is there a way to add to the merge argument or something that would
> give me the additional 100K+ IPIs and 4500+ EGs?
>
> ------------------------------
> Message: 20
> Date: Fri, 18 Feb 2011 13:17:18 -0800
> From: "Carlson, Marc R" <mcarlson at fhcrc.org>
> To: <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] IPI to entrez id
> Message-ID:
> <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Viritha,
>
> These things can never be 1:1, but you can pretty easily just cram
> them all into a huge data.frame by doing this:
>
> library(org.Hs.eg.db)
> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO),
> by.x="gene_id", by.y="gene_id")
> head(allAnnots)
>
> Once you have done this, you may notice that they are not only are
> these things almost never (if ever) 1:1, but that this could have been
> even worse if I had used the GO2ALL mappings (and I probably should
> have, but I can't really tell because I have almost no information
> about what you want to do).  Anyhow, I hope this helps you, but if you
> have a more specific use for this information that you are willing to
> talk about then we might be able to give you a better answer.
>
>
> Marc
> ------------------------------
>
> Thanks very much,
> Dick
> *******************************************************************************
>
> Richard P. Beyer, Ph.D.    University of Washington
> Tel.:(206) 616 7378    Env. & Occ. Health Sci. , Box 354695
> Fax: (206) 685 4696    4225 Roosevelt Way NE, # 100
> Seattle, WA 98105-6099
> http://depts.washington.edu/ceeh/members_fc_bioinfo.html
> http://staff.washington.edu/~dbeyer
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> *******************************************************************************
> Richard P. Beyer, Ph.D.	University of Washington
> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
> 			Seattle, WA 98105-6099
> http://depts.washington.edu/ceeh/members_fc_bioinfo.html
> http://staff.washington.edu/~dbeyer
> *******************************************************************************
>
>
>