[BioC] biomaRt- incorrect number of transcripts

Ivanek, Robert robert.ivanek at fmi.ch
Fri Oct 30 10:42:01 CET 2009


Dear mailing list,

I have recently observed a discrepancies in genome annotation obtained
via R package biomaRt. 
I wanted to download all ensembl transcripts from the entire mouse
genome (chromosome 1:19, X, Y MT only).

When I set the filter based on chromosome names I retrieved ~36000
transcript, please see the code below.
However by using the web service www.biomart.org I received ~48000
transcripts for the same genome version and chromosomes.

By comparing these two data frames you could see that the discrepancies
in number of transcripts occur only for some chromosomes (3:9 and X).
If I specified only two chromosome names (2 and 3) than the number of
downloaded transcripts is correct for both of them.
If I did not set any filter in getBM function and did the filtering
manually in R, the number of transcripts is correct. 

Session info is attached.

Best Regards
Robert

--
Robert Ivanek
Postdoctoral Fellow Schuebeler Group
Friedrich Miescher Institute
Maulbeerstrasse 66
4058 Basel / Switzerland
Office phone: +41 61 697 6100


R> library("biomaRt")
R> ensembl <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
R> chroms <- c(1:19,"X","Y","MT")
R> table(getBM(attributes = c("ensembl_transcript_id",
"chromosome_name", "strand", "transcript_start", "transcript_end"),
filters = "chromosome_name", values = chroms, mart =
ensembl)$chromosome_name)

   1   10   11   12   13   14   15   16   17   18   19    2    3    4
5    6    7    8    9   MT    X    Y 
2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 1080 1454
845 1209 1487 1129 1031   41 2072   17

R> ens.web <- read.delim("../../../mart_export.txt",stringsAsFactors=F)
R> ens.web <- ens.web[ens.web$Chromosome.Name %in% chroms,]
R> table(ens.web$Chromosome.Name)

   1   10   11   12   13   14   15   16   17   18   19    2    3    4
5    6    7    8    9   MT    X    Y 
2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
2822 2524 3919 2021 2163   41 3297   17 

R> table(getBM(attributes = c("ensembl_transcript_id",
"chromosome_name", "strand", "transcript_start", "transcript_end"),
filters = "chromosome_name", values = c("2","3","MT"), mart =
ensembl)$chromosome_name)

   2    3   MT 
5232 2179   41 


R> ens.r <- getBM(attributes = c("ensembl_transcript_id",
"chromosome_name", "strand", "transcript_start", "transcript_end"), mart
= ensembl)
R> ens.r <- ens.r[ens.r$chromosome_name %in% chroms,]
R> table(ens.r$chromosome_name)

   1   10   11   12   13   14   15   16   17   18   19    2    3    4
5    6    7    8    9   MT    X    Y 
2507 1869 4364 1501 1630 1624 1404 1522 1865  985 1245 5232 2179 3997
2822 2524 3919 2021 2163   41 3297   17 



R> sessionInfo()
R version 2.10.0 (2009-10-26) 
x86_64-unknown-linux-gnu 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=C

 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C
LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base


other attached packages:
[1] biomaRt_2.2.0

loaded via a namespace (and not attached):
[1] RCurl_1.3-0  tools_2.10.0 XML_2.6-0   



More information about the Bioconductor mailing list