[BioC] BiomaRt Ensembl RefSeq query error

Georg Otto georg.otto at imm.ox.ac.uk
Thu Jan 23 19:34:10 CET 2014


Dear Wade,

thank you very much for your response. I reverted my installation to
biomaRt_2.16.0, but the problem persists. I added some example code
below that supports the conlusion that the problem is caused by the
refseq query, and by the number of genes queried, not by specific genes.

It not only concerns the sox9 gene, but al ot of other genes too.

Unfortunately I can not send attachments her, so I will gladly send the
file with the Ensembl gene IDs upon request.

Best wishes,

Georg


library(biomaRt)

ensembl <- useMart("ensembl", dataset = 'mmusculus_gene_ensembl')

ensembl.id <- read.table(file = "ensembl-id.txt")

## the sox9 gene is at position 68
which(ensembl.id[,1] == "ENSMUSG00000000567")
## [1] 68

## query the first 1000 genes
ensembl.df <- getBM(attributes = c("ensembl_gene_id",
                      "refseq_mrna", "mgi_symbol", "description"),
                    filter="ensembl_gene_id", ensembl.id[1:1000,1], mart
                    = ensembl, uniqueRows = TRUE)

## sox9 is there
ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),]

##        ensembl_gene_id refseq_mrna mgi_symbol 141 ENSMUSG00000000567
## NM_011448 Sox9
##                                                     description 141
## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371]



## query all the genes
ensembl.df <- getBM(attributes = c("ensembl_gene_id",
                      "refseq_mrna", "mgi_symbol", "description"),
                    filter="ensembl_gene_id", ensembl.id, mart =
                    ensembl, uniqueRows = TRUE)

## sox9 is missing
ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),]
## [1] ensembl_gene_id refseq_mrna mgi_symbol description <0 rows> (or
## 0-length row.names)


## genes 1:12262, sox9 is included
ensembl.df <- getBM(attributes = c("ensembl_gene_id",
                      "refseq_mrna", "mgi_symbol", "description"),
                    filter="ensembl_gene_id", ensembl.id[1:12262,], mart
                    = ensembl, uniqueRows = TRUE)

ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),]

##        ensembl_gene_id refseq_mrna mgi_symbol 141 ENSMUSG00000000567
## NM_011448 Sox9
##                                                     description 141
## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371]


## genes 1:12263, sox9 is not included
ensembl.df <- getBM(attributes = c("ensembl_gene_id",
                      "refseq_mrna", "mgi_symbol", "description"),
                    filter="ensembl_gene_id", ensembl.id[1:12263,], mart
                    = ensembl, uniqueRows = TRUE)

ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),]

## [1] ensembl_gene_id refseq_mrna     mgi_symbol      description    
## <0 rows> (or 0-length row.names)


## but sox9 is included when refseq is omitted
ensembl.df <- getBM(attributes = c("ensembl_gene_id",
                      # "refseq_mrna",
                      "mgi_symbol", "description"),
                    filter="ensembl_gene_id", ensembl.id[1:12263,], mart
                    = ensembl, uniqueRows = TRUE)

ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),]

##       ensembl_gene_id mgi_symbol 68 ENSMUSG00000000567 Sox9
##                                                    description 68
## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371]


## the problem is not due to ensembl id #12263, because here sox 9 is present
ensembl.df <- getBM(attributes = c("ensembl_gene_id",
                      "refseq_mrna", "mgi_symbol", "description"),
                    filter="ensembl_gene_id",
                    ensembl.id[c(1:382,1400:nrow(ensembl.id)),], mart =
                    ensembl, uniqueRows = TRUE)

ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),]

##        ensembl_gene_id refseq_mrna mgi_symbol 141 ENSMUSG00000000567
## NM_011448 Sox9
##                                                     description 141
## SRY-box containing gene 9 [Source:MGI Symbol;Acc:MGI:98371]


## but one more gene, and sox 9 is missing
ensembl.df <- getBM(attributes = c("ensembl_gene_id",
                      "refseq_mrna", "mgi_symbol", "description"),
                    filter="ensembl_gene_id",
                    ensembl.id[c(1:383,1400:nrow(ensembl.id)),], mart =
                    ensembl, uniqueRows = TRUE)

ensembl.df[which(ensembl.df$ensembl_gene_id == "ENSMUSG00000000567"),]
## [1] ensembl_gene_id refseq_mrna mgi_symbol description <0 rows> (or
## 0-length row.names)


sessionInfo()

## R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu
## (64-bit)

## locale:
##  [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8
##  LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8
##  LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C
##  LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

## attached base packages: [1] stats graphics grDevices utils datasets
## methods base

## other attached packages: [1] biomaRt_2.16.0

## loaded via a namespace (and not attached): [1] compiler_3.0.1
## RCurl_1.95-4.1 tools_3.0.1 XML_3.98-1.1



"Davis, Wade"
<davisjwa at health.missouri.edu> writes:

> Georg, Using your code and calling for only "ENSMUSG00000000567" does
> not result in NA for me, as you can see:
>
>  library(biomaRt) ensembl <- useMart("ensembl", dataset =
>  'mmusculus_gene_ensembl') getBM(attributes =
>  c("ensembl_gene_id","refseq_mrna"), filter="ensembl_gene_id",
>                      "ENSMUSG00000000567",mart = ensembl, uniqueRows =
>                      TRUE)
>
>      ensembl_gene_id refseq_mrna 1 ENSMUSG00000000567 NM_011448
>
> You are running R 3.0.1 just like me, but your biomaRt is 2.18 (I'm
> running 2.16, see below). biomaRt 2.18 is part of BioC 2.13, which is
> meant for R 3.0.2 as noted here: http://www.bioconductor.org/install/
>
> That is the most likely cause.
>
> Wade
>
>
> sessionInfo() R version 3.0.1 (2013-05-16) Platform:
> x86_64-w64-mingw32/x64 (64-bit)
>
> locale: [1] LC_COLLATE=English_United States.1252
> LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United
> States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252
>
> attached base packages: [1] stats graphics grDevices utils datasets
> methods base
>
> other attached packages: [1] biomaRt_2.16.0
>
> loaded via a namespace (and not attached): [1] RCurl_1.95-4.1
> XML_3.98-1.1
>
>



More information about the Bioconductor mailing list