[BioC] Problem with filtering genes in biomaRt
Pieta Schofield [guest]
guest at bioconductor.org
Tue Feb 4 18:45:16 CET 2014
Hullo
I am getting some inconsistent results with my attempts to filter genes by chromosome when retrieving them form "hsapiens_gene_ensembl"
if I specify a list of chromosomes to retrieve from I get a different number for some of the chromosomes than if i retrieve all the genes for all chromosomes or if I retrieve the genes from the chromosomes individually.
this is my full code of the problem
thanks in advance
#!/usr/bin/Rscript --vanilla
require(biomaRt)
.getGenes <- function(chrs=list())
{ # use biomart to get ranges for feature
hg19 <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
if(length(chrs)>0)
{
getBM(attributes=c("chromosome_name",
"start_position",
"end_position",
"strand",
"ensembl_gene_id",
"external_gene_id",
"gene_biotype"),
filter="chromosome_name",
values=chrs,
mart=hg19)
}else{
getBM(attributes=c("chromosome_name",
"start_position",
"end_position",
"strand",
"ensembl_gene_id",
"external_gene_id",
"gene_biotype"),
mart=hg19)
}
}
unfiltered<-.getGenes()
chrs<-c(unlist(seq(1,22)),"X","Y","MT")
filtered<-.getGenes(chrs)
chronly<-lapply(chrs,
function(x){
length(.getGenes(x)$ensembl_gene_id)
})
names(chronly)<-chrs
allgenes<-table(unfiltered$chromosome_name)
missinggenes<-table(unfiltered$chromosome_name[which(
!(unfiltered$ensembl_gene_id
%in%
filtered$ensembl_gene_id))])
unlist(lapply(chrs,
function(x){
totalmissing<-ifelse(is.na(missinggenes[x]),0,missinggenes[x])
paste("Missing genes on chr",x,totalmissing,"of",
"individually",chronly[[x]],"all",allgenes[x])
}))
sessionInfo()
-- output of sessionInfo():
Loading required package: biomaRt
Loading required package: methods
[1] "Missing genes on chr 1 0 of individually 5321 all 5321"
[2] "Missing genes on chr 2 0 of individually 3990 all 3990"
[3] "Missing genes on chr 3 0 of individually 3043 all 3043"
[4] "Missing genes on chr 4 0 of individually 2521 all 2521"
[5] "Missing genes on chr 5 0 of individually 2856 all 2856"
[6] "Missing genes on chr 6 0 of individually 2906 all 2906"
[7] "Missing genes on chr 7 0 of individually 2818 all 2818"
[8] "Missing genes on chr 8 607 of individually 2385 all 2385"
[9] "Missing genes on chr 9 1892 of individually 2303 all 2303"
[10] "Missing genes on chr 10 0 of individually 2216 all 2216"
[11] "Missing genes on chr 11 0 of individually 3190 all 3190"
[12] "Missing genes on chr 12 0 of individually 2819 all 2819"
[13] "Missing genes on chr 13 0 of individually 1217 all 1217"
[14] "Missing genes on chr 14 0 of individually 2237 all 2237"
[15] "Missing genes on chr 15 0 of individually 2076 all 2076"
[16] "Missing genes on chr 16 0 of individually 2360 all 2360"
[17] "Missing genes on chr 17 0 of individually 2901 all 2901"
[18] "Missing genes on chr 18 0 of individually 1113 all 1113"
[19] "Missing genes on chr 19 0 of individually 2917 all 2917"
[20] "Missing genes on chr 20 0 of individually 1322 all 1322"
[21] "Missing genes on chr 21 0 of individually 720 all 720"
[22] "Missing genes on chr 22 0 of individually 1208 all 1208"
[23] "Missing genes on chr X 2028 of individually 2414 all 2414"
[24] "Missing genes on chr Y 506 of individually 506 all 506"
[25] "Missing genes on chr MT 37 of individually 37 all 37"
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] methods stats graphics grDevices utils datasets base
other attached packages:
[1] biomaRt_2.18.0
loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 XML_3.95-0.2
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list