[BioC] R: is there an identifier that uniquely identifies a gene all over the many databases ?
Simon Anders
anders at ebi.ac.uk
Mon Jul 13 12:59:15 CEST 2009
Hi
mauede at alice.it wrote:
> I forgot to specify that I am only dealing with Human species.
> I used the ENSGxxxxx identifier to get out some data that I hoped would
> uniquely identify the gene.
>
> > gene.map <-
> getBM(attributes=c("hgnc_symbol","external_gene_id","refseq_dna"),
> filters
> ="ensembl_gene_id",values="ENSG00000206557",mart=hmart)
> > show(gene.map)
>
> As long as all Human genes are uniquely identified through their
> respective "hgnc_symbol" I am fine.
>
> Why should I use the other identifier you mention ENSTxxxx ?
Well, I mentioned them because you talked about genes and transcripts as
if these two were interchangeable.
If you use Ensembl's Biomart you will usually get one data record each
transcript, not for each gene. Take, for example, the gene GLB1
(ENSG00000170266).
It has three transcripts:
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000170266;r=3:32621636-33121635;t=ENST00000307363
The first transcript (ENST00000307377) has another 3'UTR than the second
and third (ENST00000307363 and ENST00000399402).
As Steven wrote, you should add "ensembl_transcript_id" to you list of
attributes to see what is going on.
Personally, I also find it very helpful to first try out any Biomart
query on the web interface
http://www.ensembl.org/biomart/martview
before going to R. There, you can see quite easily what is going on.
Cheers
Simon
More information about the Bioconductor
mailing list