[BioC] from RefSeq to GO terms / gene symbol to geneID
Simon Lin
simonlin at duke.edu
Tue Jun 12 22:39:27 CEST 2007
In the following two unrelated messages, both Sean and Nianhua suggested
to download and parse some data tables from the NCBI. The gene_info and
several other tables seems very useful. If that is the case, why not
have it pre-loaded into a SQlite and distribute it as part of the
annotation package for human? Simon ================= Date: Tue, 12 Jun
2007 05:59:55 -0400 From: Sean Davis <sdavis2 at mail.nih.gov> Subject: Re:
[BioC] from RefSeq GI protein identifiers to GO terms To: Lina
Hultin-Rosenberg <lina.hultin-rosenberg at ki.se> Cc:
bioconductor at stat.math.ethz.ch Message-ID:
<466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain;
charset=ISO-8859-1 Lina Hultin-Rosenberg wrote:
>> Dear list,
>>
>> This might be a question that has been discussed previously but I could not
>> find any good solution for it. I have lists of human proteins from various
>> proteomics studies that I want to compare with regards to the GO terms
>> associated to them. I have the RefSeq GI protein id for the proteins and my
>> questions is how I best map those to other identifiers that I can use in
>> subsequent GO analysis?
>>
>> It might be that this problem is solved best outside R but maybe someone
>> still can give me a hint to the best solution. For me this is a problem that
>> comes up quite often - the need to map between different identifiers - and I
>> have not yet find any really good solution to it. If I for example use IPI I
>> always loose some proteins/genes since the coverage is rather bad, but maybe
>> there is no solution that will give perfect mapping?!
>
>
The file located here:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
and described in detail here:
ftp://ftp.ncbi.nih.gov/gene/DATA/README
maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you
can use the bioconductor annotation packages to get GO mappings. The
file above is a tab-delimited text file, so you should be able to read
it into R and do the matching by GI number rather easily.
Hope that helps.
Sean
========================
Message: 4
Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC)
From: Nianhua Li <nialicn at yahoo.com>
Subject: Re: [BioC] getting Locus Link ids from gene symbol
To: bioconductor at stat.math.ethz.ch
Message-ID: <loom.20070611T142932-100 at post.gmane.org>
Content-Type: text/plain; charset=us-ascii
Hi, Alex,
You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol
(column 3), and Synonyms (column 5). You can:
1 Read in the file
2 filter it based on tax_id
3 match your gene symboles to the "Symbol" column and find their Gene ID
4 removed the matched gene symboles from your list
5 match the rest of gene symboles to the "Synonyms" column and find their Gene
ID
hope this helps
nianhua
Nianhua Li
Software Developer
More information about the Bioconductor
mailing list