[BioC] IPI to entrez id
Marc Carlson
mcarlson at fhcrc.org
Thu Mar 10 02:25:34 CET 2011
Hi Dick,
Is there any reason why something like this won't work for you to attach
the GO Ids?
merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id")
When I build the org.Hs.eg.db package, I download the IPI Ids in the
mySQL database from the following source:
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current
The most recent time I did this (last week) results in an ipi to gene
table that contains only 86719 unique IPIs. And that number drops a bit
when I fold the IPI IDs into the org database (which takes as its set of
unique entrez gene IDs the ones that are currently listed at NCBI) to
77387 distinct IPI IDs. And if you merge with a GO table, I expect that
it will drop a bit more.
In looking at the file that you are parsing, I get exactly the same
number of unique IPI ids if I extract from the ID fields and match to
the Entrez Gene fields (which also gives me the exact same number of
entrez gene IDs). I can only get the huge additional number of IPI Ids
from this file if I also mine the AC field and assume that these IPI Ids
also should map to the exact same things. But the direct database dump
from EBI does not give me these mappings. In fact, it does not seem to
even contain them. This causes me to be concerned that maybe these IDs
may not what you think they are?
Anyhow, I hope this helps,
Marc
On 03/08/2011 11:39 AM, Dick Beyer wrote:
> Hi to all,
>
> For several years now, I have been doing GO analysis on lists of
> proteins derived from MS. I am given IPIs by the proteomics folks and
> need the corresponding Entrez Gene IDs. Putting aside the issues of
> non-unique mapping from IPI to EG, isoforms, etc., I was wondering if
> anyone would comment on my method of getting the Entrez Gene IDs. I'd
> really like to use Marc Carlson's merge method (shown below), but that
> approach seems to miss several thousand IPI/EG matches that my method
> finds.
>
> I start with
> ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and
> extract a subset of the rows:
>
> ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011
> dbfetch.all <- ipiHUMAN
> rm(ipiHUMAN)
>
> # Explanation of the data format is found here
> # http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss
>
> length(dbfetch.all) # 3180244
> length(eg <- grep("^DR Entrez Gene", dbfetch.all)) # 80296
> length(ids <- grep("^ID", dbfetch.all)) # 86719
> length(de <- grep("^DE", dbfetch.all)) # 92454
> length(ac <- grep("^AC", dbfetch.all)) # 93720
> length(ug <- grep("^DR UniGene", dbfetch.all)) # 88314
> length(up <- grep("^DR UniProtKB", dbfetch.all)) # 110593
> length(en <- grep("^DR ENSEMBL", dbfetch.all)) # 77340
> length(rs <- grep("^DR REFSEQ_REVIEWED",dbfetch.all)) # 14559
>
> and eventually turn this into a data.frame with the columns:
>
> "IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEWED"
>
> (Note: Not every IPI entry has every field)
>
> For this build of the IPI file, my data.frame ends up as
> dim(dat.all)
> [1] 183153 7
>
> Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique
> Entrez Gene IDs.
>
> The merge method shown below from Marc Carlson gives 69315 unique IPIs
> and 17783 unique Entrez Gene IDs (you get the same numbers whether you
> use org.Hs.egGO2ALLEGS or org.Hs.egGO).
>
> When I build my 7 column data.frame, I initially get 22305 unique
> Entrez Gene IDs, and I then go through some additional steps of trying
> to fill in the missing EGs. I do this by taking the IPIs with no EGs,
> and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(),
> and hope I get a few more EGs.
>
> For example:
>
> library(biomaRt)
> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
> length(which(is.na(dat.all[,4])))
> sum(z <- !is.na(dat.all[,4]))
> w <-
> getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description"),filters="unigene",values=dat.all[z,4],
> mart=mart)
>
> By doing several of these getBM() steps, I add 37 more EGs!
>
> My method is long and painful. That merge approach is clean and
> beautiful.
>
> Is there a way to add to the merge argument or something that would
> give me the additional 100K+ IPIs and 4500+ EGs?
>
> ------------------------------
> Message: 20
> Date: Fri, 18 Feb 2011 13:17:18 -0800
> From: "Carlson, Marc R" <mcarlson at fhcrc.org>
> To: <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] IPI to entrez id
> Message-ID:
> <1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Viritha,
>
> These things can never be 1:1, but you can pretty easily just cram
> them all into a huge data.frame by doing this:
>
> library(org.Hs.eg.db)
> allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO),
> by.x="gene_id", by.y="gene_id")
> head(allAnnots)
>
> Once you have done this, you may notice that they are not only are
> these things almost never (if ever) 1:1, but that this could have been
> even worse if I had used the GO2ALL mappings (and I probably should
> have, but I can't really tell because I have almost no information
> about what you want to do). Anyhow, I hope this helps you, but if you
> have a more specific use for this information that you are willing to
> talk about then we might be able to give you a better answer.
>
>
> Marc
> ------------------------------
>
> Thanks very much,
> Dick
> *******************************************************************************
>
> Richard P. Beyer, Ph.D. University of Washington
> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
> Seattle, WA 98105-6099
> http://depts.washington.edu/ceeh/members_fc_bioinfo.html
> http://staff.washington.edu/~dbeyer
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list