[BioC] unable to find known entrezgene with biomaRt

Sat Jan 19 21:30:33 CET 2008

Hi Jim,

As always, you offer great help.  Thanks for taking the time to give me examples as well. That is much appreciated.

I wasn't aware of org.Hs.eg.db.  That seems like exactly what I should be using for what I need.

I guess I should start paying closer attention to BioC announcements :-)

Cheers,
Dick

*******************************************************************************
Richard P. Beyer, Ph.D.	University of Washington
Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
 			Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
http://staff.washington.edu/~dbeyer
*******************************************************************************

On Sat, 19 Jan 2008, James MacDonald wrote:

> Hi Dick,
>
> What information are you starting with? Do you just need the gene symbol and 
> description?
>
> If you have the Entrez Gene ID it is really simple.
>
>> library(org.Hs.eg.db)
>> get("3514", org.Hs.egSYMBOL)
> [1] "IGKC"
>> get("3514", org.Hs.egGENENAME)
> [1] "immunoglobulin kappa constant"
>
> If you have multiple IDs, then of course you need to use mget() and then 
> wrangle the resulting lists into whatever shape you need. An alternative with 
> the sweet new SQLite db format (thanks to the friendly folks in Seattle) is to 
> dump everything out and then subset from there.
>
>> ids <- ls(org.Hs.egSYMBOL)[1:10] ##some random IDs
>> thesymbs <- toTable(org.Hs.egSYMBOL) ##dump
>> thesymbs[thesymbs[,1] %in% ids,]
>   gene_id   symbol
> 1        1     A1BG
> 2        2      A2M
> 3        9     NAT1
> 4       10     NAT2
> 5       12 SERPINA3
> 6       13    AADAC
> 7       14     AAMP
> 8       15    AANAT
> 9       16     AARS
> 10      18     ABAT
>
> If you have the Ensembl ID I would use biomaRt.
>
>> getBM(c("hgnc_symbol", "description"), "ensembl_gene_id", 
> "ENSG00000211592",mart=mart, output="list")
> $hgnc_symbol
> $hgnc_symbol$ENSG00000211592
> [1] NA
>
>
> $description
> $description$ENSG00000211592
> [1] "Immunoglobulin Kappa light chain C gene segment 
> [Source:IMGT/GENE_DB;Acc:IGKC]"
>
> As noted before, the information from the two sources doesn't always agree 
> 100%, which is sorta weird in this case since the description field from 
> Ensembl _does_ contain the gene symbol.
>
> Anyway I hope that helps.
>
>
> Best,
>
> Jim
>
>
>
> Dick Beyer wrote:
>> Hi Jim,
>> 
>> Thanks for explaining this to me.  I had assumed that if the gene was in 
>> ensembl, then I could get other bits of info such as Entrez Gene ID and 
>> such.
>> 
>> Is there some bioconductor way, similar to biomaRt, to access this Entrez 
>> Gene ID?  What I am really using the getBM call for is just to get a gene 
>> symbol and a gene description given the Entrez Gene ID.
>> 
>> Thanks very much,
>> Dick 
>> *******************************************************************************
>> Richard P. Beyer, Ph.D.	University of Washington
>> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>>  			Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>> 
>> On Sat, 19 Jan 2008, James W. MacDonald wrote:
>> 
>>> Hi Dick,
>>> 
>>> I'm not sure I understand your question. When I go to the webpage you 
>>> reference, there is AFAICT no mention of this gene being the same as 
>>> Entrez Gene 3514 (other than having the same symbol). Nor does Entrez Gene 
>>> mention that it is the same as Ensembl Gene ENSG00000211592.
>>> 
>>> A quick look at the location of the gene would imply that it probably is 
>>> the same, and not two genes that have the same symbol (which is not 
>>> unique).
>>> 
>>> Since both the web interface and the programmatic interface agree, this 
>>> isn't a matter of inconsistencies between the interfaces, so perhaps the 
>>> question is why do Entrez Gene and Ensembl not reference each other?
>>> 
>>> If so, this I think is simply due to the fact that you have two different 
>>> groups that are doing the annotation, and they are not always perfect at 
>>> referencing each other.
>>> 
>>> Best,
>>> 
>>> Jim
>>> 
>>> 
>>> 
>>> Dick Beyer wrote:
>>>> Hello,
>>>> 
>>>> I am unable to find some Entrez Gene IDs in the ensembl homo sapiens 
>>>> database via biomaRt, even though I can access them via the ensembl web.
>>>> 
>>>> library(biomaRt)
>>>> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
>>>> 
>>>> getBM(attributes=c("entrezgene","hgnc_symbol","ensembl_gene_id"),filters="entrezgene",values=3845, 
>>>> mart=mart)
>>>>    entrezgene hgnc_symbol ensembl_gene_id
>>>> 1       3845        KRAS ENSG00000133703
>>>> 
>>>> getBM(attributes=c("entrezgene","hgnc_symbol","ensembl_gene_id"),filters="entrezgene",values=3514, 
>>>> mart=mart)
>>>> NULL
>>>> 
>>>> The ensembl web interface:
>>>> 
>>>> http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000211592
>>>> 
>>>> shows Entrez Gene ID 3514 corresponds to ensembl_gene_id 
>>>> ENSG00000211592, IGKC.
>>>> 
>>>> I'm curious why my biomaRt session will return good results for some 
>>>> valid Entrez Gene IDs but not for others.  I'm not sure what to try 
>>>> next.  I'd very much appreciate any help.
>>>> 
>>>> sessionInfo()
>>>> R version 2.6.1 (2007-11-26)
>>>> x86_64-redhat-linux-gnu
>>>> 
>>>> locale:
>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>>> 
>>>> attached base packages:
>>>> [1] tools     stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>> 
>>>> other attached packages:
>>>>   [1] topGO_1.4.0         SparseM_0.75        AnnotationDbi_1.0.6
>>>>   [4] RSQLite_0.6-4       DBI_0.2-4           GO_2.0.1
>>>>   [7] Biobase_1.16.2      graph_1.16.1        biomaRt_1.12.2
>>>> [10] RCurl_0.8-3
>>>> 
>>>> loaded via a namespace (and not attached):
>>>> [1] cluster_1.11.9  rcompgen_0.1-17 XML_1.93-2
>>>> 
>>>> Thanks much,
>>>> Dick
>>>> *******************************************************************************
>>>> Richard P. Beyer, Ph.D.	University of Washington
>>>> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
>>>> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>>>>  			Seattle, WA 98105-6099
>>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>>>> http://staff.washington.edu/~dbeyer
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -- 
> James W. MacDonald, MS
> Biostatistician
> UMCCC cDNA and Affymetrix Core
> University of Michigan
> 1500 E Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
>