[BioC] from RefSeq to GO terms / gene symbol to geneID

Fri Jun 29 07:54:00 CEST 2007

Dear Simon and Sean,

sorry to get back to this issue so late but I have tried out various 
options to try to solve it. I parsed the files you mentioned but did not 
get many hits since many of my proteins does not have a Entrez gene id 
for some reason. In my search I also tried some of the Entrez e-utils 
(http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) 
and could get the accession numbers for my proteins. Can I go from 
accession number to GO term using biomaRt for example?

Thanks again!

Best,
Lina Rosenberg

Simon Lin skrev:
> In the following two unrelated messages, both Sean and Nianhua suggested 
> to download and parse some data tables from the NCBI. The gene_info and 
> several other tables seems very useful. If that is the case, why not 
> have it pre-loaded into a SQlite and distribute it as part of the 
> annotation package for human? Simon ================= Date: Tue, 12 Jun 
> 2007 05:59:55 -0400 From: Sean Davis <sdavis2 at mail.nih.gov> Subject: Re: 
> [BioC] from RefSeq GI protein identifiers to GO terms To: Lina 
> Hultin-Rosenberg <lina.hultin-rosenberg at ki.se> Cc: 
> bioconductor at stat.math.ethz.ch Message-ID: 
> <466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain; 
> charset=ISO-8859-1 Lina Hultin-Rosenberg wrote:
> 
>>> Dear list,
>>>
>>> This might be a question that has been discussed previously but I could not
>>> find any good solution for it. I have lists of human proteins from various
>>> proteomics studies that I want to compare with regards to the GO terms
>>> associated to them. I have the RefSeq GI protein id for the proteins and my
>>> questions is how I best map those to other identifiers that I can use in
>>> subsequent GO analysis? 
>>>
>>> It might be that this problem is solved best outside R but maybe someone
>>> still can give me a hint to the best solution. For me this is a problem that
>>> comes up quite often - the need to map between different identifiers - and I
>>> have not yet find any really good solution to it. If I for example use IPI I
>>> always loose some proteins/genes since the coverage is rather bad, but maybe
>>> there is no solution that will give perfect mapping?!
>>  
>>
> 
> The file located here:
> 
> ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
> 
> and described in detail here:
> 
> ftp://ftp.ncbi.nih.gov/gene/DATA/README
> 
> maps refseq to Entrez Gene ID.  Once you have the Entrez Gene ID, you
> can use the bioconductor annotation packages to get GO mappings.  The
> file above is a tab-delimited text file, so you should be able to read
> it into R and do the matching by GI number rather easily.
> 
> Hope that helps.
> 
> Sean
> 
> ========================
> Message: 4
> Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC)
> From: Nianhua Li <nialicn at yahoo.com>
> Subject: Re: [BioC] getting Locus Link ids from gene symbol
> To: bioconductor at stat.math.ethz.ch
> Message-ID: <loom.20070611T142932-100 at post.gmane.org>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi, Alex,
> 
> You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
> There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol 
> (column 3), and Synonyms (column 5). You can:
> 
> 1 Read in the file
> 2 filter it based on tax_id
> 3 match your gene symboles to the "Symbol" column and find their Gene ID
> 4 removed the matched gene symboles from your list
> 5 match the rest of gene symboles to the "Synonyms" column and find their Gene 
> ID
> 
> hope this helps
> 
> nianhua
> 
> Nianhua Li
> Software Developer
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
>