[BioC] BiomaRt Ensembl RefSeq query error

Thu Jan 23 04:15:56 CET 2014

Georg,
Using your code and calling for only "ENSMUSG00000000567" does not result in NA for me, as you can see:

 library(biomaRt)
 ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl')
 getBM(attributes = c("ensembl_gene_id","refseq_mrna"), filter="ensembl_gene_id",
                     "ENSMUSG00000000567",mart = ensembl, uniqueRows = TRUE)

     ensembl_gene_id refseq_mrna
1 ENSMUSG00000000567   NM_011448

You are running R 3.0.1 just like me, but your biomaRt is 2.18 (I'm running 2.16, see below). biomaRt 2.18 is part of BioC 2.13, which is meant for R 3.0.2 as noted here:
http://www.bioconductor.org/install/

That is the most likely cause.

Wade

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.16.0

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 XML_3.98-1.1  

-----Original Message-----
From: Georg Otto [mailto:georg.otto at imm.ox.ac.uk] 
Sent: Tuesday, January 21, 2014 6:49 AM
To: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] BiomaRt Ensembl RefSeq query error

as an amendment to my previous post, here is the sessionInfo():

R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.18.0

loaded via a namespace (and not attached):
 [1] annotate_1.40.0      AnnotationDbi_1.24.0 Biobase_2.22.0      
 [4] BiocGenerics_0.8.0   compiler_3.0.1       DBI_0.2-7           
 [7] DESeq_1.14.0         genefilter_1.44.0    geneplotter_1.40.0  
[10] grid_3.0.1           IRanges_1.20.6       lattice_0.20-24     
[13] parallel_3.0.1       RColorBrewer_1.0-5   RCurl_1.95-4.1      
[16] RSQLite_0.11.4       splines_3.0.1        stats4_3.0.1        
[19] survival_2.37-4      tools_3.0.1          XML_3.98-1.1        
[22] xtable_1.7-1        

Georg Otto <georg.otto at imm.ox.ac.uk> writes:

> Dear Bioconductors,
>
> I am trying to query 14005 Ensembl gene IDs for their Refseq 
> annotations using this code (I can send the gene IDs upon request):
>
> ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl')
>
> getBM(attributes = c("ensembl_gene_id",
>                       "refseq_mrna"), filter="ensembl_gene_id",
>                     ensembl.ids,
>                     mart = ensembl, uniqueRows = TRUE)
>
>
> If I query for the full gene set, many RefSeq IDs are missing (NA), 
> for example for the gene ENSMUSG00000000567 (sox9), whereas if I query 
> for a subset, say ensembl.ids[1:12000], all the RefSeq IDs are there. 
> It does not seem to matter which subset I use, but the size of the 
> subset has to be smaller than ca. 12000 genes.
>
> Any idea what is going on?
>
> Best wishes,
>
> Georg
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor