[BioC] unable to find known entrezgene with biomaRt
Dick Beyer
dbeyer at u.washington.edu
Sat Jan 19 21:30:33 CET 2008
Hi Jim,
As always, you offer great help. Thanks for taking the time to give me examples as well. That is much appreciated.
I wasn't aware of org.Hs.eg.db. That seems like exactly what I should be using for what I need.
I guess I should start paying closer attention to BioC announcements :-)
Cheers,
Dick
*******************************************************************************
Richard P. Beyer, Ph.D. University of Washington
Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
http://staff.washington.edu/~dbeyer
*******************************************************************************
On Sat, 19 Jan 2008, James MacDonald wrote:
> Hi Dick,
>
> What information are you starting with? Do you just need the gene symbol and
> description?
>
> If you have the Entrez Gene ID it is really simple.
>
>> library(org.Hs.eg.db)
>> get("3514", org.Hs.egSYMBOL)
> [1] "IGKC"
>> get("3514", org.Hs.egGENENAME)
> [1] "immunoglobulin kappa constant"
>
> If you have multiple IDs, then of course you need to use mget() and then
> wrangle the resulting lists into whatever shape you need. An alternative with
> the sweet new SQLite db format (thanks to the friendly folks in Seattle) is to
> dump everything out and then subset from there.
>
>> ids <- ls(org.Hs.egSYMBOL)[1:10] ##some random IDs
>> thesymbs <- toTable(org.Hs.egSYMBOL) ##dump
>> thesymbs[thesymbs[,1] %in% ids,]
> gene_id symbol
> 1 1 A1BG
> 2 2 A2M
> 3 9 NAT1
> 4 10 NAT2
> 5 12 SERPINA3
> 6 13 AADAC
> 7 14 AAMP
> 8 15 AANAT
> 9 16 AARS
> 10 18 ABAT
>
> If you have the Ensembl ID I would use biomaRt.
>
>> getBM(c("hgnc_symbol", "description"), "ensembl_gene_id",
> "ENSG00000211592",mart=mart, output="list")
> $hgnc_symbol
> $hgnc_symbol$ENSG00000211592
> [1] NA
>
>
> $description
> $description$ENSG00000211592
> [1] "Immunoglobulin Kappa light chain C gene segment
> [Source:IMGT/GENE_DB;Acc:IGKC]"
>
> As noted before, the information from the two sources doesn't always agree
> 100%, which is sorta weird in this case since the description field from
> Ensembl _does_ contain the gene symbol.
>
> Anyway I hope that helps.
>
>
> Best,
>
> Jim
>
>
>
> Dick Beyer wrote:
>> Hi Jim,
>>
>> Thanks for explaining this to me. I had assumed that if the gene was in
>> ensembl, then I could get other bits of info such as Entrez Gene ID and
>> such.
>>
>> Is there some bioconductor way, similar to biomaRt, to access this Entrez
>> Gene ID? What I am really using the getBM call for is just to get a gene
>> symbol and a gene description given the Entrez Gene ID.
>>
>> Thanks very much,
>> Dick
>> *******************************************************************************
>> Richard P. Beyer, Ph.D. University of Washington
>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
>> Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>>
>> On Sat, 19 Jan 2008, James W. MacDonald wrote:
>>
>>> Hi Dick,
>>>
>>> I'm not sure I understand your question. When I go to the webpage you
>>> reference, there is AFAICT no mention of this gene being the same as
>>> Entrez Gene 3514 (other than having the same symbol). Nor does Entrez Gene
>>> mention that it is the same as Ensembl Gene ENSG00000211592.
>>>
>>> A quick look at the location of the gene would imply that it probably is
>>> the same, and not two genes that have the same symbol (which is not
>>> unique).
>>>
>>> Since both the web interface and the programmatic interface agree, this
>>> isn't a matter of inconsistencies between the interfaces, so perhaps the
>>> question is why do Entrez Gene and Ensembl not reference each other?
>>>
>>> If so, this I think is simply due to the fact that you have two different
>>> groups that are doing the annotation, and they are not always perfect at
>>> referencing each other.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> Dick Beyer wrote:
>>>> Hello,
>>>>
>>>> I am unable to find some Entrez Gene IDs in the ensembl homo sapiens
>>>> database via biomaRt, even though I can access them via the ensembl web.
>>>>
>>>> library(biomaRt)
>>>> mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
>>>>
>>>> getBM(attributes=c("entrezgene","hgnc_symbol","ensembl_gene_id"),filters="entrezgene",values=3845,
>>>> mart=mart)
>>>> entrezgene hgnc_symbol ensembl_gene_id
>>>> 1 3845 KRAS ENSG00000133703
>>>>
>>>> getBM(attributes=c("entrezgene","hgnc_symbol","ensembl_gene_id"),filters="entrezgene",values=3514,
>>>> mart=mart)
>>>> NULL
>>>>
>>>> The ensembl web interface:
>>>>
>>>> http://www.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000211592
>>>>
>>>> shows Entrez Gene ID 3514 corresponds to ensembl_gene_id
>>>> ENSG00000211592, IGKC.
>>>>
>>>> I'm curious why my biomaRt session will return good results for some
>>>> valid Entrez Gene IDs but not for others. I'm not sure what to try
>>>> next. I'd very much appreciate any help.
>>>>
>>>> sessionInfo()
>>>> R version 2.6.1 (2007-11-26)
>>>> x86_64-redhat-linux-gnu
>>>>
>>>> locale:
>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] tools stats graphics grDevices utils datasets methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] topGO_1.4.0 SparseM_0.75 AnnotationDbi_1.0.6
>>>> [4] RSQLite_0.6-4 DBI_0.2-4 GO_2.0.1
>>>> [7] Biobase_1.16.2 graph_1.16.1 biomaRt_1.12.2
>>>> [10] RCurl_0.8-3
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] cluster_1.11.9 rcompgen_0.1-17 XML_1.93-2
>>>>
>>>> Thanks much,
>>>> Dick
>>>> *******************************************************************************
>>>> Richard P. Beyer, Ph.D. University of Washington
>>>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
>>>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
>>>> Seattle, WA 98105-6099
>>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>>>> http://staff.washington.edu/~dbeyer
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> James W. MacDonald, MS
> Biostatistician
> UMCCC cDNA and Affymetrix Core
> University of Michigan
> 1500 E Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
>
More information about the Bioconductor
mailing list