[BioC] IPI to entrez id

Thu Mar 10 20:16:41 CET 2011

Hi Marc,

Thanks for the good explanation.  I do want the deprecated IPIs (or get the proteomics folks I work with to update things at their end more frequently?).

Maybe it would be cleaner to just massage the IPIs I get so as to change deprecated IPIs to current ones, then use the merged GO and PROSITE table.  I'll have to play around with that and see.  I fuss around a lot trying to get every possible Entrez Gene ID for the IPIs I deal with.  So when I get Entrez Gene IDs for a subset of a list of IPIs, I go through several extra steps (biomaRt etc) to try and find Entrez Gene IDs for the IPIs that are missing them.

Thanks very much,
Dick
*******************************************************************************
Richard P. Beyer, Ph.D.	University of Washington
Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
 			Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/members_fc_bioinfo.html
http://staff.washington.edu/~dbeyer
*******************************************************************************

On Thu, 10 Mar 2011, Marc Carlson wrote:

> Hi Dick,
>
> So lets look at that example that you gave in your last post where you
> merged the GO table with the PROSITE one.  It is important to understand
> that when you call merge(), it does it's magic by basically performing
> an inner join on the two tables that comprise it's 1st two first
> arguments.  Therefore the mechanics of how that merge will do its job
> mean that you have effectively restricted the results to only those
> entrez genes where you have BOTH a GO annotation AND a PROSITE annotation.
>
> So the result you see (more EGs from the bigger join) would happen if
> for example you had increased the size of the IPI tables (by pairing up
> some deprecated IPI ids with some legitimate entrez gene IDs).  In this
> situation, these entrez gene IDs would be perfectly legitimate, but
> their IPI IDs would all be older deprecated IPI ids.  I am not sure if
> that is what you really want or not, but if it is, then the final table
> indeed would be bigger.
>
>
> Hope that clarifies things,
>
>
>  Marc
>
>
>
>
>
> On 03/10/2011 07:16 AM, Dick Beyer wrote:
>> Hi Marc,
>>
>> Thanks very much for the merge example.  That's so much cleaner than my usual approach.
>>
>> As far as the differing numbers of IPIs, I agree that using the AC field gives a lot more IPIs than just using the ID field.  I guess I'm trying to solve a different problem than other folks.  I get these lists of IPIs from proteomics folks, and some of the IPIs wouldn't show up in the ID field of "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz", but are present in the AC field because they are no longer the current primary identifier (http://www.ebi.ac.uk/IPI/Algorithm.html#SECONDARIES).  What I am really after is the Entrez Gene ID, so for me, the IPI is just a pointer to the EG.
>>
>> However, with this approach:
>>
>> library(org.Hs.eg.db)
>> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id")
>>
>> I get several thousand fewer EGs than using either ipi.HUMAN.dat or ipi.HUMAN.xrefs.  I looked at a few of the EGs that I get out of these that are not in org.Hs.eg.db, and they seem valid.  I'll do some more checking to be sure.  In ipi.HUMAN.xrefs I get 22113 unique EGs and 86719 unique IPIs from a download 2 days ago from "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz"
>>
>> What I had hoped I was creating was a table with any and all IPIs (primary and secondary) but each with valid EGs.
>>
>> Thanks very much for your great help,
>> Dick
>> ------------------------------
>>
>> Message: 24
>> Date: Wed, 09 Mar 2011 17:25:34 -0800
>> From: Marc Carlson <mcarlson at fhcrc.org>
>> To: bioconductor at r-project.org
>> Subject: Re: [BioC] IPI to entrez id
>> Message-ID: <4D78288E.6060904 at fhcrc.org>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> Hi Dick,
>>
>> Is there any reason why something like this won't work for you to attach
>> the GO Ids?
>>
>> merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id")
>>
>>
>>
>> When I build the org.Hs.eg.db package, I download the IPI Ids in the
>> mySQL database from the following source:
>>
>> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current
>>
>>
>> The most recent time I did this (last week) results in an ipi to gene
>> table that contains only 86719 unique IPIs.  And that number drops a bit
>> when I fold the IPI IDs into the org database (which takes as its set of
>> unique entrez gene IDs the ones that are currently listed at NCBI) to
>> 77387 distinct IPI IDs.  And if you merge with a GO table, I expect that
>> it will drop a bit more.
>>
>> In looking at the file that you are parsing, I get exactly the same
>> number of unique IPI ids if I extract from the ID fields and match to
>> the Entrez Gene fields (which also gives me the exact same number of
>> entrez gene IDs).  I can only get the huge additional number of IPI Ids
>> from this file if I also mine the AC field and assume that these IPI Ids
>> also should map to the exact same things.  But the direct database dump
>> from EBI does not give me these mappings.  In fact, it does not seem to
>> even contain them.  This causes me to be concerned that maybe these IDs
>> may not what you think they are?
>>
>>
>> Anyhow, I hope this helps,
>>
>>
>>   Marc
>>
>>
>>
>>
>> On 03/08/2011 11:39 AM, Dick Beyer wrote:
>> Hi to all,
>>
>> For several years now, I have been doing GO analysis on lists of
>> proteins derived from MS.  I am given IPIs by the proteomics folks and
>> need the corresponding Entrez Gene IDs.  Putting aside the issues of
>> non-unique mapping from IPI to EG, isoforms, etc., I was wondering if
>> anyone would comment on my method of getting the Entrez Gene IDs. I'd
>> really like to use Marc Carlson's merge method (shown below), but that
>> approach seems to miss several thousand IPI/EG matches that my method
>> finds.
>>
>> I start with
>> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and
>> extract a subset of the rows:
>>
>> ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011
>> dbfetch.all <- ipiHUMAN
>> rm(ipiHUMAN)
>>
>> # Explanation of the data format is found here
>> # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss
>>
>> length(dbfetch.all) #  3180244
>> length(eg  <- grep("^DR   Entrez Gene",    dbfetch.all)) # 80296
>> length(ids <- grep("^ID",                  dbfetch.all)) # 86719
>> length(de  <- grep("^DE",                  dbfetch.all)) # 92454
>> length(ac  <- grep("^AC",                  dbfetch.all)) # 93720
>> length(ug  <- grep("^DR   UniGene",        dbfetch.all)) # 88314
>> length(up  <- grep("^DR   UniProtKB",      dbfetch.all)) # 110593
>> length(en  <- grep("^DR   ENSEMBL",        dbfetch.all)) # 77340
>> length(rs  <- grep("^DR   REFSEQ_REVIEWED",dbfetch.all)) # 14559
>>
>> and eventually turn this into a data.frame with the columns:
>>
>> "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEWED"
>>
>> (Note: Not every IPI entry has every field)
>>
>> For this build of the IPI file, my data.frame ends up as
>> dim(dat.all)
>> [1] 183153      7
>>
>> Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique
>> Entrez Gene IDs.
>>
>> The merge method shown below from Marc Carlson gives 69315 unique IPIs
>> and 17783 unique Entrez Gene IDs (you get the same numbers whether you
>> use org.Hs.egGO2ALLEGS or org.Hs.egGO).
>>
>> When I build my 7 column data.frame, I initially get 22305 unique
>> Entrez Gene IDs, and I then go through some additional steps of trying
>> to fill in the missing EGs.  I do this by taking the IPIs with no EGs,
>> and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(),
>> and hope I get a few more EGs.
>>
>> For example:
>>
>> library(biomaRt)
>> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
>> length(which(is.na(dat.all[,4])))
>> sum(z <- !is.na(dat.all[,4]))
>> w <-
>> getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description"),filters="unigene",values=dat.all[z,4],
>> mart=mart)
>>
>> By doing several of these getBM() steps, I add 37 more EGs!
>>
>> My method is long and painful.  That merge approach is clean and
>> beautiful.
>>
>> Is there a way to add to the merge argument or something that would
>> give me the additional 100K+ IPIs and 4500+ EGs?
>>
>> ------------------------------
>> Message: 20
>> Date: Fri, 18 Feb 2011 13:17:18 -0800
>> From: "Carlson, Marc R" <mcarlson at fhcrc.org>
>> To: <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] IPI to entrez id
>> Message-ID:
>> <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Viritha,
>>
>> These things can never be 1:1, but you can pretty easily just cram
>> them all into a huge data.frame by doing this:
>>
>> library(org.Hs.eg.db)
>> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO),
>> by.x="gene_id", by.y="gene_id")
>> head(allAnnots)
>>
>> Once you have done this, you may notice that they are not only are
>> these things almost never (if ever) 1:1, but that this could have been
>> even worse if I had used the GO2ALL mappings (and I probably should
>> have, but I can't really tell because I have almost no information
>> about what you want to do).  Anyhow, I hope this helps you, but if you
>> have a more specific use for this information that you are willing to
>> talk about then we might be able to give you a better answer.
>>
>>
>> Marc
>> ------------------------------
>>
>> Thanks very much,
>> Dick
>> *******************************************************************************
>>
>> Richard P. Beyer, Ph.D.    University of Washington
>> Tel.:(206) 616 7378    Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696    4225 Roosevelt Way NE, # 100
>> Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/members_fc_bioinfo.html
>> http://staff.washington.edu/~dbeyer
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> *******************************************************************************
>> Richard P. Beyer, Ph.D.	University of Washington
>> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>> 			Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/members_fc_bioinfo.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>>
>>
>>
>
>