[BioC] mis-matched gene symbols and entrez ID in biomaRt

Iain Gallagher iaingallagher at btopenworld.com
Wed Sep 7 10:53:30 CEST 2011


Hi Wendy

The version of ensembl you're using is very old. The current database is at 63. 

Identifiers (even entrez / ENSEMBL IDs) come and go as knowledge of the genome changes. Some identifers stay in the databases but get tagged as 'retired'. If the number of mismatches is low you can sort this out manually using the Web based Entrez Gene query system. If it's a large number then the e.g. org.Hs.eg.db packages may help (although you may lose a few genes because of retired IDs).

symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5')
EGIDS <- unlist(mget(symbols, org.Hs.egSYMBOL2EG, ifnotfound = NA))

Another handy package is limma which has the alias2SymbolTable function so you can convert your list of symbols (which may contain a mixture of official symbols and 'alias' symbols) to official symbols only:

e.g. 
symbols <- c('BAGE2', 'BAGE3,' 'BAGE4', 'BAGE5')
symbolsOfficial <- alias2SymbolTable(symbols,species="Hs")

Note that this example may just return the same symbols... I haven't run the code.

You  might want to run this before using the org.Hs.eg.db package above to make sure all your symbols are official.

Best

iain

--- On Wed, 7/9/11, Wendy Qiao <wendy2.qiao at gmail.com> wrote:

> From: Wendy Qiao <wendy2.qiao at gmail.com>
> Subject: [BioC] mis-matched gene symbols and entrez ID in biomaRt
> To: bioconductor at r-project.org
> Date: Wednesday, 7 September, 2011, 4:24
> Hi all,
> 
> I am converting the HGNC symbols from an Illumina human
> array to Entrez ID
> using biomaRt. I found that there are some gene symbols are
> matched to many
> Entrez IDs, and vice versa. I am wondering if how to solve
> the problem, so
> one gene symbol is only matched to one Entrez ID. Or is
> there any other
> package that I can use for matching gene symbols to Entrez
> IDs. Thank you in
> advance.
> 
> Wendy
> 
> =====
> In the following example, BAGE2, 3, 4 and 5 are matched to
> 85316 and 85317
> which are the Entrez IDs of BAGE5 and BAGE4, respectively.
> 
> library('biomaRt')
> ensembl=useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",archive=TRUE)
> Entrez<-getBM(attributes=c("hgnc_symbol","entrezgene"),filters="hgnc_symbol",values=GeneList,mart=ensembl)
> # class(GeneList) = factor
> 
> Entrez[1:20,]
>    hgnc_symbol entrezgene
> 1        ZFP62     
> 92379
> 2     C9orf169 
>    375791
> 3       FAM72D 
>    653573
> 4         HMX1   
>      NA
> 5         HMX1   
>    3166
> 6        ZFP62     
>    NA
> 7        RSPO4 
>    343637
> 8        DOC2B   
>    8447
> 9      C8orf42 
>    157695
> 10       TTTY8     
>    NA
> 11       A26C3     
>    NA
> 12       BAGE5     
> 85316
> 13       BAGE4     
> 85316
> 14       BAGE3     
> 85316
> 15       BAGE2     
> 85316
> 16       BAGE5     
> 85317
> 17       BAGE4     
> 85317
> 18       BAGE3     
> 85317
> 19       BAGE2     
> 85317
> 20        NBR1   
>    4077
> 
>     [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list