[BioC] Search queries with biomaRt does not align with online queries via ensembl
Hotz, Hans-Rudolf
hrh at fmi.ch
Mon Mar 1 16:47:10 CET 2010
On 3/1/10 4:07 PM, "Tony Chiang" <tchiang at fhcrc.org> wrote:
> Thanks Hans,
>
> That worked much better. Quick follow up question then (I guess for anyone
> who might know the answer), when would we use the hgnc gene names rather the
> the symbols? It would appear that ATF4 is a valid hgnc gene name
as far as I understand 'hgnc_symbol' should always work (if the symbol does
exist). The HGNC does assign (or rather approve) 'symbols', and 'names'
refer to written out names, see:
http://www.genenames.org/data/hgnc_data.php?hgnc_id=786
Ensembl uses the HGNC symbol as 'Name', see:
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000128272
=> notice the label 'curated'
Hence for this particular symbol, you can also use the biomart filter
"hgnc_curated_gene_nam", eg:
> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", values="ATF4",
mart=ensembl)
ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
1 ENSP00000384587 468 ENSG00000128272 NA
2 ENSP00000336790 468 ENSG00000128272 NA
3 ENSP00000379912 468 ENSG00000128272 NA
>
How ever, if you look at 'IGHA2', see:
http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000211890
-> notice the label 'automatic'
Hence, the biomart filter "hgnc_curated_gene_name" will not work, but the
biomart filter "hgnc_curated_automatic_name" will work, eg:
> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_curated_gene_name", values="IGHA2",
mart=ensembl)
[1] ensembl_peptide_id entrezgene ensembl_gene_id
[4] hgnc_automatic_gene_name
<0 rows> (or 0-length row.names)
> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_automatic_gene_name", values="IGHA2",
mart=ensembl)
ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
1 ENSP00000418606 NA ENSG00000211890 IGHA2
2 ENSP00000374980 NA ENSG00000211890 IGHA2
3 ENSP00000374981 NA ENSG00000211890 IGHA2
>
and 'hgnc_symbol' always work, eg:
> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
"hgnc_automatic_gene_name"), filters="hgnc_symbol", values="IGHA2",
mart=ensembl)
ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
1 ENSP00000418606 NA ENSG00000211890 IGHA2
2 ENSP00000374980 NA ENSG00000211890 IGHA2
3 ENSP00000374981 NA ENSG00000211890 IGHA2
>
Now, the follow up question is: how does ensembl distinguish between
'curated' and 'automatic'? well, I am no more fully familiar with ensembl,
but I assume, that the entry for IGHA2 has no (not yet) support from their
manual curators...there is also no link back to vega on the HGNC web page
for 'IGHA2', and there is one for 'ATF4'
I hope this clarifies the situation
Hans
> so I
> thought that the obvious choice would have been to filter based on
> hgnc_automatic_gene_name but this is obviously not the case. I guess what I
> am trying to ask is how do I know what to use as the filter when it would
> seem like there is an obvious candidate to chose but is not the correct one?
>
> Cheers,
> --Tony
>
>
> On Mon, Mar 1, 2010 at 12:31 AM, Hotz, Hans-Rudolf <hrh at fmi.ch> wrote:
>
>>
>>
>>
>> On 2/28/10 7:16 PM, "Tony Chiang" <tchiang at fhcrc.org> wrote:
>>
>>> Hi Steffen et al,
>>>
>>> Quick question about a search query via biomaRt. Here is the code that I
>> am
>>> using:
>>>
>>> *****
>>> library(biomaRt)
>>> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
>>> filters = listFilters(ensembl)
>>> attributes = listAttributes(ensembl)
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene",
>>> "ensembl_gene_id", "hgnc_automatic_gene_name"),
>>> filters="hgnc_automatic_gene_name", values="ATF4",
>>> mart=ensembl)
>>> *****
>>
>> try ' filters="hgnc_symbol" ', eg:
>>
>>
>>> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
>> "hgnc_automatic_gene_name"), filters="hgnc_symbol", values="ATF4",
>> mart=ensembl)
>> ensembl_peptide_id entrezgene ensembl_gene_id hgnc_automatic_gene_name
>> 1 ENSP00000384587 468 ENSG00000128272 NA
>> 2 ENSP00000336790 468 ENSG00000128272 NA
>> 3 ENSP00000379912 468 ENSG00000128272 NA
>>>
>>
>>
>>
>> Hans
>>
>>> For me, this returns an empty data frame. But when I query ATF4 online at
>>> ensembl, I find what I need. I also looked up ATF4 at genenames.org(HUGO)
>>> and it seems that ATF4 is a valid hgnc gene name, so the filter so be
>> fine.
>>> I guess the only other reason that I can see is which dataset I use in
>> the
>>> useMart function. I am guessing that the online API will search through
>> all
>>> datasets while I am only specifying a single one? If this is true, do you
>>> know of a sensible work around? I have about 150 genes that I would like
>>> mapped to the EBML ID names but using the code above with a vector of
>> gene
>>> names, I can only map around 25...but if I manually query for some of the
>>> non-mapped gene names, I get what I am after. If I am wrong about my
>> guess
>>> in the dataset, can you let me know what you think might be going on?
>>>
>>> Tony
>>>
>>>> sessionInfo()
>>> R version 2.11.0 Under development (unstable) (2010-01-16 r50993)
>>> i386-apple-darwin10.2.0
>>>
>>> locale:
>>> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8
>>>
>>> attached base packages:
>>> [1] grid stats graphics grDevices utils datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] hgu133plus2.db_2.3.5 org.Hs.eg.db_2.3.6 Rgraphviz_1.25.1
>>> [4] biomaRt_2.3.0 GOstats_2.13.0 RSQLite_0.8-1
>>> [7] DBI_0.2-5 Category_2.13.0 AnnotationDbi_1.9.4
>>> [10] Biobase_2.7.3 RBGL_1.23.0 graph_1.25.5
>>>
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.25.1 genefilter_1.29.5 GO.db_2.3.5 GSEABase_1.9.0
>>> [5] RCurl_1.3-1 splines_2.11.0 survival_2.35-8 tools_2.11.0
>>> [9] XML_2.6-0 xtable_1.5-6
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
More information about the Bioconductor
mailing list