[BioC] biomart query: ensembl gene id and entrez gene id confusion
James W. MacDonald
jmacdon at med.umich.edu
Tue Aug 23 15:39:55 CEST 2011
Hi Natasha,
On 8/23/2011 8:37 AM, Natasha Sahgal wrote:
> Dear List,
>
> I want to extract ensembl gene ids from biomart to add to my
> microarray analysis output. However, there are some discrepancies
> that have me confused regarding the entrez gene id and ensemble gene
> id.
>
> Array used: Illumina HumanHT12 v4.
>
>
> As an example: GAGE12F, GAGE12G, GAGE12I genes
>
> Microarray: Illumina HT12 v4 output:
>
>
>
> Entrez_Gene_ID Symbol Chromosome Probe_Id Probe_Type Cytoband
>
> 26748 GAGE12I X ILMN_1691563 A Xp11.23b
>
> 100008586 GAGE12F X ILMN_3242920 S Xp11.23b
>
> 645073 GAGE12G X ILMN_1664660 S Xp11.23b
>
> Definition
>
> Homo sapiens G antigen 12I (GAGE12I), mRNA.
>
> Homo sapiens G antigen 12F (GAGE12F), mRNA.
>
> Homo sapiens G antigen 12G (GAGE12G), mRNA.
>
>
>
>
>
> Biomart output:
>
>
>
> entrezgene ensembl_gene_id hgnc_symbol
>
> 1 100008586 ENSG00000241465 GAGE12I
>
> 2 100008586 ENSG00000236362 GAGE12F
>
> 3 100008586 ENSG00000215269 GAGE12G
>
> 1022 26748 ENSG00000241465 GAGE12I
>
> 1023 26748 ENSG00000236362 GAGE12F
>
> 1024 26748 ENSG00000215269 GAGE12G
>
> 2392 645073 ENSG00000241465 GAGE12I
>
> 2393 645073 ENSG00000236362 GAGE12F
>
> 2394 645073 ENSG00000215269 GAGE12G
>
> So please help me understand, why are there multiple results rather
> than true unique results. If I merge the two, based on the above, I
> would get an incorrectly merged file. (I cannot use the Illumina HT12
> probe ids as a filter, as I was informed that in biomart these are
> mapped to the HT12 v3 chip).
This has to do with the difference between manufacturer mappings and
annotations of genes. When the manufacturer creates a chip, they intend
each reporter to interrogate a single transcript, and it is not unheard
of for them to ignore other cross-hybridizing transcripts.
On the other hand, the annotation of genes and especially cross-listing
between annotation data bases cannot be so single minded. In the case of
the GAGE genes, there are multiple transcripts with the same name, and
multiple genomic positions for each transcript. If you look at the UCSC
Genome Browser at this position:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=208507527&hgt_doJsCommand=&position=chrX%3A49%2C323%2C231-49%2C333%2C657&hgtgroup_map_close=0&hgtgroup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expression_close=1&hgtgroup_regulation_close=1&hgtgroup_compGeno_close=0&hgtgroup_neandertal_close=0&hgtgroup_varRep_close=0
you can see that there are transcripts labeled GAGE12C, D, E, F, G, and
I that are identical but for the UTR regions. I am actually surprised
that the Biomart server didn't return these other GAGE12 transcripts as
well.
Best,
Jim
>
> R code and sessionifno:
>
> library(biomaRt) library(DESeq) library(gdata)
>
> m_h_a2 # 3995 15 (limma output for a given comparison)
>
> length(unique(m_h_a2$Entrez_Gene_ID)) # 3506
> length(unique(m_h_a2$Symbol)) # 3542
> length(unique(m_h_a2$Probe_Id)) # 3995
>
> ## Non-NA's mh.ona = na.omit(m_h_a2) # 3912 17
>
> ## Unique ids mh.u.eg =
> m_h_a2[match(unique(m_h_a2$Entrez_Gene_ID),m_h_a2$Entrez_Gene_ID),] #
> 3506 15 mh.u.eg = na.omit(mh.u.eg) # 3505 15
>
> ensembl = useMart("ensembl", dataset="hsapiens_gene_ensembl")
>
> mh_eg.ens<- getBM(attributes =
> c("entrezgene","ensembl_gene_id","hgnc_symbol"), filters =
> "entrezgene", values = mh.u.eg$Entrez_Gene_ID, mart = ensembl) # 3305
> 3
>
> ### I would like to merge mh.u.eg with mh_eg.ens
>
> ##sessionInfo R version 2.13.0 (2011-04-13) Platform:
> x86_64-pc-linux-gnu (64-bit)
>
> locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3]
> LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C
> LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9]
> LC_ADDRESS=C LC_TELEPHONE=C [11]
> LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages: [1] stats graphics grDevices utils
> datasets methods base
>
> other attached packages: [1] scatterplot3d_0.3-33 WriteXLS_2.1.0
> gdata_2.8.2 [4] DESeq_1.4.1 locfit_1.5-6
> lattice_0.19-23 [7] akima_0.5-4 Biobase_2.12.1
> biomaRt_2.8.0
>
> loaded via a namespace (and not attached): [1] annotate_1.30.0
> AnnotationDbi_1.14.1 DBI_0.2-5 [4] genefilter_1.34.0
> geneplotter_1.30.0 grid_2.13.0 [7] gtools_2.6.2
> RColorBrewer_1.0-2 RCurl_1.6-4 [10] RSQLite_0.9-4
> splines_2.13.0 survival_2.36-5 [13] tools_2.13.0
> XML_3.4-0 xtable_1.5-6
>
>
> Many Thanks, Natasha
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________ Bioconductor mailing
> list Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list