[BioC] getBM

Steffen at stat.Berkeley.EDU Steffen at stat.Berkeley.EDU
Tue Dec 8 18:09:00 CET 2009


Dear Andreia,

The result of a biomaRt query is indeed not sorted according to the input.
There is not a one to one mapping for all three identifiers you are
querying for.  This partially explains the expansion in your result.  In
addition, Ensembl annotates everything to the transcript and some
transcript ids are mapped to hgnc_symbols/entrezgene ids and others not
(that's why you get repetitive information and NAs).

For example in your result you have:

ensembl_gene_id entrezgene hgnc_symbol
> 1 ENSG00000198692       9086      EIF1AY
> 2 ENSG00000198692         NA      EIF1AY

This looks like repetitive information and you get an NA but,
if you would add the ensembl_transcript_id to your query you would get:

getBM(attributes =
c("ensembl_gene_id","ensembl_transcript_id","entrezgene","hgnc_symbol"),filters
= "ensembl_gene_id", values ="ENSG00000198692", mart=human)
  ensembl_gene_id ensembl_transcript_id entrezgene hgnc_symbol
1 ENSG00000198692       ENST00000382772         NA      EIF1AY
2 ENSG00000198692       ENST00000361365       9086      EIF1AY


As you see the transcript ENST00000382772 was not associated with the
entrezgene id 9086 but transcript  ENST00000361365 of that same gene was.

To avoid getting NAs and duplication, I would do your query in two steps
and combine the results in R.

1) get a map from ensembl_gene_id to entrezgene

map1 = getBM(attributes =
c("ensembl_gene_id","entrezgene"),filters=c("with_entrezgene","ensembl_gene_id"),
values=list(TRUE,dataHT[,1]),mart=human)

2) get a map from ensembl_gene_id to hgnc_symbol

map2 = getBM(attributes =
c("ensembl_gene_id","hgnc_symbol"),filters=c("with_hgnc","ensembl_gene_id"),
values=list(TRUE,dataHT[,1]),mart=human)

Cheers,
Steffen

> Dear Forum,
>
> I am trying to get the entrezgene information for a list of
> ensembl_gene_id,
> the command that I am using is
> test3<-getBM(attributes = c("ensembl_gene_id",
> "entrezgene","hgnc_symbol"),
> filters = "ensembl_gene_id", values =dataHT[,1], mart=human)
>
> the list dataHT[,1] has 10,987 unique ids and the first ensembl_gene_id is
> ENSG00000000003 which corresponds to entrezgene 7105
> the result has 18533 rows and the first value is not 7105, showing that
> the
> query is not happening by order in the filter vector and I am getting too
> many hits, with doubled lines with almost the same information, and not by
> the order of the query vector (see result below), can someone, help me
> with
> this?
> head(test3)
>   ensembl_gene_id entrezgene hgnc_symbol
> 1 ENSG00000198692       9086      EIF1AY
> 2 ENSG00000198692         NA      EIF1AY
> 3 ENSG00000101557       9097       USP14
> 4 ENSG00000079134       9984       THOC1
> 5 ENSG00000158270      81035     COLEC12
> 6 ENSG00000079101      27098       CLUL1
>
> Thanks
> Andreia
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list