[BioC] Search queries with biomaRt does not align with online queries via ensembl
James W. MacDonald
jmacdon at med.umich.edu
Mon Mar 1 16:55:27 CET 2010
Hi Tony,
ATF4 isn't a valid gene name, it's a HUGO gene symbol. The gene name can
be retrieved using the 'description' attribute. So you have to know that
ATF4 is a gene symbol, and that Ensembl calls these things hgnc_symbols.
But your question still remains. How to decide which of the often
inscrutable filters/attributes should one use to get a set of results?
This is compounded by the fact that Ensembl will sometimes change what
they call things. For instance, hgnc_symbol was once simply symbol. And
for a while there, one had to know that for humans you used symbol, but
for mice you used mgi_symbol...
There isn't a quick answer to this question. Steffen added a second
column to the output of both listFilters() and listAttributes() that may
help (although often times it is the same as the first, minus the
underscores). What it often comes down to is trial and error, choosing
different attributes that might plausibly return what you want.
One strategy I use is to try the shortest possible attribute name that
might describe what I want. It seems the more descriptors are added to a
given attribute, the less data on the back end. So for instance,
something like hgnc_automatic_gene_name would be quite low on a list of
attributes that I would explore. OTOH, "curated" might be more useful,
so hgnc_curated_gene_name to me is more likely to bear fruit.
> getBM(c("hgnc_symbol","description","hgnc_curated_gene_name"),
"hgnc_symbol", "ATF4", mart)
hgnc_symbol
1 ATF4
description
1 Cyclic AMP-dependent transcription factor ATF-4 (cAMP-dependent
transcription factor ATF-4)(Activating transcription factor
4)(DNA-binding protein TAXREB67)(Cyclic AMP-responsive element-binding
protein 2)(cAMP-responsive element-binding protein 2)(CREB-2)
[Source:UniProtKB/Swiss-Prot;Acc:P18848]
hgnc_curated_gene_name
1 ATF4
Best,
Jim
Tony Chiang wrote:
> Thanks Hans,
>
> That worked much better. Quick follow up question then (I guess for anyone
> who might know the answer), when would we use the hgnc gene names rather the
> the symbols? It would appear that ATF4 is a valid hgnc gene name so I
> thought that the obvious choice would have been to filter based on
> hgnc_automatic_gene_name but this is obviously not the case. I guess what I
> am trying to ask is how do I know what to use as the filter when it would
> seem like there is an obvious candidate to chose but is not the correct one?
>
> Cheers,
> --Tony
>
>
> On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh at fmi.ch> wrote:
>
>>
>>
>> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang at fhcrc.org> wrote:
>>
>>> Hi Steffen et al,
>>>
>>> Quick question about a search query via biomaRt. Here is the code that I
>> am
>>> using:
>>>
>>> *****
>>> library(biomaRt)
>>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
>>> filters = listFilters(ensembl)
>>> attributes = listAttributes(ensembl)
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene",
>>> "ensembl_gene_id", "hgnc_automatic_gene_name"),
>>> filters="hgnc_automatic_gene_name", values="ATF4",
>>> mart=ensembl)
>>> *****
>> try ' filters="hgnc_symbol" ', eg:
>>
>>
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
>> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4",
>> mart=ensembl)
>> ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
>> 1 ENSP00000384587 468 ENSG00000128272 NA
>> 2 ENSP00000336790 468 ENSG00000128272 NA
>> 3 ENSP00000379912 468 ENSG00000128272 NA
>>
>>
>> Hans
>>
>>> For me, this returns an empty data frame. But when I query ATF4 online at
>>> ensembl, I find what I need. I also looked up ATF4 at genenames.org(HUGO)
>>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be
>> fine.
>>> I guess the only other reason that I can see is which dataset I use in
>> the
>>> useMart function. I am guessing that the online API will search through
>> all
>>> datasets while I am only specifying a single one? If this is true, do you
>>> know of a sensible work around? I have about 150 genes that I would like
>>> mapped to the EBML ID names but using the code above with a vector of
>> gene
>>> names, I can only map around 25...but if I manually query for some of the
>>> non-mapped gene names, I get what I am after. If I am wrong about my
>> guess
>>> in the dataset, can you let me know what you think might be going on?
>>>
>>> Tony
>>>
>>>> sessionInfo()
>>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993)
>>> i386-apple-darwin10.2.0
>>>
>>> locale:
>>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8
>>>
>>> attached base packages:
>>> [1] grid stats graphics grDevices utils datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1
>>> [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1
>>> [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4
>>> [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5
>>>
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 GSEABase_1.9.0
>>> [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0
>>> [9] XML_2.6-0 xtable_1.5-6
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list