[BioC] biomart query: ensembl gene id and entrez gene id confusion

Tue Aug 23 15:39:55 CEST 2011

Hi Natasha,

On 8/23/2011 8:37 AM, Natasha Sahgal wrote:
> Dear List,
>
> I want to extract ensembl gene ids from biomart to add to my
> microarray analysis output. However, there are some discrepancies
> that have me confused regarding the entrez gene id and ensemble gene
> id.
>
> Array used: Illumina HumanHT12 v4.
>
>
> As an example: GAGE12F, GAGE12G, GAGE12I genes
>
> Microarray: Illumina HT12 v4 output:
>
>
>
> Entrez_Gene_ID  Symbol Chromosome     Probe_Id Probe_Type Cytoband
>
> 26748 GAGE12I          X ILMN_1691563          A Xp11.23b
>
> 100008586 GAGE12F          X ILMN_3242920          S Xp11.23b
>
> 645073 GAGE12G          X ILMN_1664660          S Xp11.23b
>
> Definition
>
> Homo sapiens G antigen 12I (GAGE12I), mRNA.
>
> Homo sapiens G antigen 12F (GAGE12F), mRNA.
>
> Homo sapiens G antigen 12G (GAGE12G), mRNA.
>
>
>
>
>
> Biomart output:
>
>
>
> entrezgene ensembl_gene_id hgnc_symbol
>
> 1     100008586 ENSG00000241465     GAGE12I
>
> 2     100008586 ENSG00000236362     GAGE12F
>
> 3     100008586 ENSG00000215269     GAGE12G
>
> 1022      26748 ENSG00000241465     GAGE12I
>
> 1023      26748 ENSG00000236362     GAGE12F
>
> 1024      26748 ENSG00000215269     GAGE12G
>
> 2392     645073 ENSG00000241465     GAGE12I
>
> 2393     645073 ENSG00000236362     GAGE12F
>
> 2394     645073 ENSG00000215269     GAGE12G
>
> So please help me understand, why are there multiple results rather
> than true unique results. If I merge the two, based on the above, I
> would get an incorrectly merged file. (I cannot use the Illumina HT12
> probe ids as a filter, as I was informed that in biomart these are
> mapped to the HT12 v3 chip).

This has to do with the difference between manufacturer mappings and 
annotations of genes. When the manufacturer creates a chip, they intend 
each reporter to interrogate a single transcript, and it is not unheard 
of for them to ignore other cross-hybridizing transcripts.

On the other hand, the annotation of genes and especially cross-listing 
between annotation data bases cannot be so single minded. In the case of 
the GAGE genes, there are multiple transcripts with the same name, and 
multiple genomic positions for each transcript. If you look at the UCSC 
Genome Browser at this position:

http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=208507527&hgt_doJsCommand=&position=chrX%3A49%2C323%2C231-49%2C333%2C657&hgtgroup_map_close=0&hgtgroup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expression_close=1&hgtgroup_regulation_close=1&hgtgroup_compGeno_close=0&hgtgroup_neandertal_close=0&hgtgroup_varRep_close=0

you can see that there are transcripts labeled GAGE12C, D, E, F, G, and 
I that are identical but for the UTR regions. I am actually surprised 
that the Biomart server didn't return these other GAGE12 transcripts as 
well.

Best,

Jim

>
> R code and sessionifno:
>
> library(biomaRt) library(DESeq) library(gdata)
>
> m_h_a2  # 3995  15 (limma output for a given comparison)
>
> length(unique(m_h_a2$Entrez_Gene_ID)) # 3506
> length(unique(m_h_a2$Symbol))         # 3542
> length(unique(m_h_a2$Probe_Id))       # 3995
>
> ## Non-NA's mh.ona = na.omit(m_h_a2)    # 3912   17
>
> ## Unique ids mh.u.eg =
> m_h_a2[match(unique(m_h_a2$Entrez_Gene_ID),m_h_a2$Entrez_Gene_ID),] #
> 3506   15 mh.u.eg = na.omit(mh.u.eg) # 3505   15
>
> ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl")
>
> mh_eg.ens<- getBM(attributes =
> c("entrezgene","ensembl_gene_id","hgnc_symbol"), filters =
> "entrezgene", values = mh.u.eg$Entrez_Gene_ID, mart = ensembl) # 3305
> 3
>
> ### I would like to merge mh.u.eg  with mh_eg.ens
>
> ##sessionInfo R version 2.13.0 (2011-04-13) Platform:
> x86_64-pc-linux-gnu (64-bit)
>
> locale: [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C [3]
> LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C
> LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C [9]
> LC_ADDRESS=C               LC_TELEPHONE=C [11]
> LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages: [1] stats     graphics  grDevices utils
> datasets  methods   base
>
> other attached packages: [1] scatterplot3d_0.3-33 WriteXLS_2.1.0
> gdata_2.8.2 [4] DESeq_1.4.1          locfit_1.5-6
> lattice_0.19-23 [7] akima_0.5-4          Biobase_2.12.1
> biomaRt_2.8.0
>
> loaded via a namespace (and not attached): [1] annotate_1.30.0
> AnnotationDbi_1.14.1 DBI_0.2-5 [4] genefilter_1.34.0
> geneplotter_1.30.0   grid_2.13.0 [7] gtools_2.6.2
> RColorBrewer_1.0-2   RCurl_1.6-4 [10] RSQLite_0.9-4
> splines_2.13.0       survival_2.36-5 [13] tools_2.13.0
> XML_3.4-0            xtable_1.5-6
>
>
> Many Thanks, Natasha
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________ Bioconductor mailing
> list Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues