[BioC] from RefSeq to GO terms / gene symbol to geneID
Lina Hultin-Rosenberg
lina.hultin-rosenberg at ki.se
Fri Jun 29 07:54:00 CEST 2007
Dear Simon and Sean,
sorry to get back to this issue so late but I have tried out various
options to try to solve it. I parsed the files you mentioned but did not
get many hits since many of my proteins does not have a Entrez gene id
for some reason. In my search I also tried some of the Entrez e-utils
(http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html)
and could get the accession numbers for my proteins. Can I go from
accession number to GO term using biomaRt for example?
Thanks again!
Best,
Lina Rosenberg
Simon Lin skrev:
> In the following two unrelated messages, both Sean and Nianhua suggested
> to download and parse some data tables from the NCBI. The gene_info and
> several other tables seems very useful. If that is the case, why not
> have it pre-loaded into a SQlite and distribute it as part of the
> annotation package for human? Simon ================= Date: Tue, 12 Jun
> 2007 05:59:55 -0400 From: Sean Davis <sdavis2 at mail.nih.gov> Subject: Re:
> [BioC] from RefSeq GI protein identifiers to GO terms To: Lina
> Hultin-Rosenberg <lina.hultin-rosenberg at ki.se> Cc:
> bioconductor at stat.math.ethz.ch Message-ID:
> <466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain;
> charset=ISO-8859-1 Lina Hultin-Rosenberg wrote:
>
>>> Dear list,
>>>
>>> This might be a question that has been discussed previously but I could not
>>> find any good solution for it. I have lists of human proteins from various
>>> proteomics studies that I want to compare with regards to the GO terms
>>> associated to them. I have the RefSeq GI protein id for the proteins and my
>>> questions is how I best map those to other identifiers that I can use in
>>> subsequent GO analysis?
>>>
>>> It might be that this problem is solved best outside R but maybe someone
>>> still can give me a hint to the best solution. For me this is a problem that
>>> comes up quite often - the need to map between different identifiers - and I
>>> have not yet find any really good solution to it. If I for example use IPI I
>>> always loose some proteins/genes since the coverage is rather bad, but maybe
>>> there is no solution that will give perfect mapping?!
>>
>>
>
> The file located here:
>
> ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
>
> and described in detail here:
>
> ftp://ftp.ncbi.nih.gov/gene/DATA/README
>
> maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you
> can use the bioconductor annotation packages to get GO mappings. The
> file above is a tab-delimited text file, so you should be able to read
> it into R and do the matching by GI number rather easily.
>
> Hope that helps.
>
> Sean
>
> ========================
> Message: 4
> Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC)
> From: Nianhua Li <nialicn at yahoo.com>
> Subject: Re: [BioC] getting Locus Link ids from gene symbol
> To: bioconductor at stat.math.ethz.ch
> Message-ID: <loom.20070611T142932-100 at post.gmane.org>
> Content-Type: text/plain; charset=us-ascii
>
> Hi, Alex,
>
> You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
> There are 4 useful columns: tax_id (column 1), GeneID (column 2), Symbol
> (column 3), and Synonyms (column 5). You can:
>
> 1 Read in the file
> 2 filter it based on tax_id
> 3 match your gene symboles to the "Symbol" column and find their Gene ID
> 4 removed the matched gene symboles from your list
> 5 match the rest of gene symboles to the "Synonyms" column and find their Gene
> ID
>
> hope this helps
>
> nianhua
>
> Nianhua Li
> Software Developer
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list