[BioC] IPI to entrez id
Dick Beyer
dbeyer at u.washington.edu
Thu Mar 10 20:16:41 CET 2011
Hi Marc,
Thanks for the good explanation. I do want the deprecated IPIs (or get the proteomics folks I work with to update things at their end more frequently?).
Maybe it would be cleaner to just massage the IPIs I get so as to change deprecated IPIs to current ones, then use the merged GO and PROSITE table. I'll have to play around with that and see. I fuss around a lot trying to get every possible Entrez Gene ID for the IPIs I deal with. So when I get Entrez Gene IDs for a subset of a list of IPIs, I go through several extra steps (biomaRt etc) to try and find Entrez Gene IDs for the IPIs that are missing them.
Thanks very much,
Dick
*******************************************************************************
Richard P. Beyer, Ph.D. University of Washington
Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/members_fc_bioinfo.html
http://staff.washington.edu/~dbeyer
*******************************************************************************
On Thu, 10 Mar 2011, Marc Carlson wrote:
> Hi Dick,
>
> So lets look at that example that you gave in your last post where you
> merged the GO table with the PROSITE one. It is important to understand
> that when you call merge(), it does it's magic by basically performing
> an inner join on the two tables that comprise it's 1st two first
> arguments. Therefore the mechanics of how that merge will do its job
> mean that you have effectively restricted the results to only those
> entrez genes where you have BOTH a GO annotation AND a PROSITE annotation.
>
> So the result you see (more EGs from the bigger join) would happen if
> for example you had increased the size of the IPI tables (by pairing up
> some deprecated IPI ids with some legitimate entrez gene IDs). In this
> situation, these entrez gene IDs would be perfectly legitimate, but
> their IPI IDs would all be older deprecated IPI ids. I am not sure if
> that is what you really want or not, but if it is, then the final table
> indeed would be bigger.
>
>
> Hope that clarifies things,
>
>
> Marc
>
>
>
>
>
> On 03/10/2011 07:16 AM, Dick Beyer wrote:
>> Hi Marc,
>>
>> Thanks very much for the merge example. That's so much cleaner than my usual approach.
>>
>> As far as the differing numbers of IPIs, I agree that using the AC field gives a lot more IPIs than just using the ID field. I guess I'm trying to solve a different problem than other folks. I get these lists of IPIs from proteomics folks, and some of the IPIs wouldn't show up in the ID field of "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz", but are present in the AC field because they are no longer the current primary identifier (http://www.ebi.ac.uk/IPI/Algorithm.html#SECONDARIES). What I am really after is the Entrez Gene ID, so for me, the IPI is just a pointer to the EG.
>>
>> However, with this approach:
>>
>> library(org.Hs.eg.db)
>> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id")
>>
>> I get several thousand fewer EGs than using either ipi.HUMAN.dat or ipi.HUMAN.xrefs. I looked at a few of the EGs that I get out of these that are not in org.Hs.eg.db, and they seem valid. I'll do some more checking to be sure. In ipi.HUMAN.xrefs I get 22113 unique EGs and 86719 unique IPIs from a download 2 days ago from "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz"
>>
>> What I had hoped I was creating was a table with any and all IPIs (primary and secondary) but each with valid EGs.
>>
>> Thanks very much for your great help,
>> Dick
>> ------------------------------
>>
>> Message: 24
>> Date: Wed, 09 Mar 2011 17:25:34 -0800
>> From: Marc Carlson <mcarlson at fhcrc.org>
>> To: bioconductor at r-project.org
>> Subject: Re: [BioC] IPI to entrez id
>> Message-ID: <4D78288E.6060904 at fhcrc.org>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> Hi Dick,
>>
>> Is there any reason why something like this won't work for you to attach
>> the GO Ids?
>>
>> merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id")
>>
>>
>>
>> When I build the org.Hs.eg.db package, I download the IPI Ids in the
>> mySQL database from the following source:
>>
>> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current
>>
>>
>> The most recent time I did this (last week) results in an ipi to gene
>> table that contains only 86719 unique IPIs. And that number drops a bit
>> when I fold the IPI IDs into the org database (which takes as its set of
>> unique entrez gene IDs the ones that are currently listed at NCBI) to
>> 77387 distinct IPI IDs. And if you merge with a GO table, I expect that
>> it will drop a bit more.
>>
>> In looking at the file that you are parsing, I get exactly the same
>> number of unique IPI ids if I extract from the ID fields and match to
>> the Entrez Gene fields (which also gives me the exact same number of
>> entrez gene IDs). I can only get the huge additional number of IPI Ids
>> from this file if I also mine the AC field and assume that these IPI Ids
>> also should map to the exact same things. But the direct database dump
>> from EBI does not give me these mappings. In fact, it does not seem to
>> even contain them. This causes me to be concerned that maybe these IDs
>> may not what you think they are?
>>
>>
>> Anyhow, I hope this helps,
>>
>>
>> Marc
>>
>>
>>
>>
>> On 03/08/2011 11:39 AM, Dick Beyer wrote:
>> Hi to all,
>>
>> For several years now, I have been doing GO analysis on lists of
>> proteins derived from MS. I am given IPIs by the proteomics folks and
>> need the corresponding Entrez Gene IDs. Putting aside the issues of
>> non-unique mapping from IPI to EG, isoforms, etc., I was wondering if
>> anyone would comment on my method of getting the Entrez Gene IDs. I'd
>> really like to use Marc Carlson's merge method (shown below), but that
>> approach seems to miss several thousand IPI/EG matches that my method
>> finds.
>>
>> I start with
>> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and
>> extract a subset of the rows:
>>
>> ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011
>> dbfetch.all <- ipiHUMAN
>> rm(ipiHUMAN)
>>
>> # Explanation of the data format is found here
>> # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss
>>
>> length(dbfetch.all) # 3180244
>> length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296
>> length(ids <- grep("^ID", dbfetch.all)) # 86719
>> length(de <- grep("^DE", dbfetch.all)) # 92454
>> length(ac <- grep("^AC", dbfetch.all)) # 93720
>> length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314
>> length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593
>> length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340
>> length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559
>>
>> and eventually turn this into a data.frame with the columns:
>>
>> "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEWED"
>>
>> (Note: Not every IPI entry has every field)
>>
>> For this build of the IPI file, my data.frame ends up as
>> dim(dat.all)
>> [1] 183153 7
>>
>> Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique
>> Entrez Gene IDs.
>>
>> The merge method shown below from Marc Carlson gives 69315 unique IPIs
>> and 17783 unique Entrez Gene IDs (you get the same numbers whether you
>> use org.Hs.egGO2ALLEGS or org.Hs.egGO).
>>
>> When I build my 7 column data.frame, I initially get 22305 unique
>> Entrez Gene IDs, and I then go through some additional steps of trying
>> to fill in the missing EGs. I do this by taking the IPIs with no EGs,
>> and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(),
>> and hope I get a few more EGs.
>>
>> For example:
>>
>> library(biomaRt)
>> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
>> length(which(is.na(dat.all[,4])))
>> sum(z <- !is.na(dat.all[,4]))
>> w <-
>> getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description"),filters="unigene",values=dat.all[z,4],
>> mart=mart)
>>
>> By doing several of these getBM() steps, I add 37 more EGs!
>>
>> My method is long and painful. That merge approach is clean and
>> beautiful.
>>
>> Is there a way to add to the merge argument or something that would
>> give me the additional 100K+ IPIs and 4500+ EGs?
>>
>> ------------------------------
>> Message: 20
>> Date: Fri, 18 Feb 2011 13:17:18 -0800
>> From: "Carlson, Marc R" <mcarlson at fhcrc.org>
>> To: <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] IPI to entrez id
>> Message-ID:
>> <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Viritha,
>>
>> These things can never be 1:1, but you can pretty easily just cram
>> them all into a huge data.frame by doing this:
>>
>> library(org.Hs.eg.db)
>> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO),
>> by.x="gene_id", by.y="gene_id")
>> head(allAnnots)
>>
>> Once you have done this, you may notice that they are not only are
>> these things almost never (if ever) 1:1, but that this could have been
>> even worse if I had used the GO2ALL mappings (and I probably should
>> have, but I can't really tell because I have almost no information
>> about what you want to do). Anyhow, I hope this helps you, but if you
>> have a more specific use for this information that you are willing to
>> talk about then we might be able to give you a better answer.
>>
>>
>> Marc
>> ------------------------------
>>
>> Thanks very much,
>> Dick
>> *******************************************************************************
>>
>> Richard P. Beyer, Ph.D. University of Washington
>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
>> Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/members_fc_bioinfo.html
>> http://staff.washington.edu/~dbeyer
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>> *******************************************************************************
>> Richard P. Beyer, Ph.D. University of Washington
>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
>> Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/members_fc_bioinfo.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>>
>>
>>
>
>
More information about the Bioconductor
mailing list