[BioC] biomaRt question -- not getting a gene

Elizabeth Purdom epurdom at stat.berkeley.edu
Mon Aug 18 23:11:53 CEST 2008

I am baffled by something I happened to discover in the results of my 
query with biomaRt and I can't figure out what's going on. I am using 
getBM to pull down a large number of gene coordinates, and filtering to 
restrict to chromosomes 1-22 and X,Y. For some reason this procedure 
(which is giving no errors) is not pulling down some genes that I think 
it should.

My basic code for pulling down all of this information is:
tempAll<-getBM(c("ensembl_gene_id", "start_position", 
"end_position","strand","chromosome_name","biotype"),filter = 
"chromosome_name", values = c(1:22, "X", "Y"),mart = mart)

A particular gene, "ENSG00000011677", is found by 'getGene' (and other 
getBM queries with different filters, as I discuss below) but not in my 
main query:
 > getGene("ENSG00000011677","ensembl_gene_id",mart)
   ensembl_gene_id hgnc_symbol
1 ENSG00000011677      GABRA3
1 Gamma-aminobutyric acid receptor subunit alpha-3 precursor (GABA(A) 
receptor subunit alpha-3). [Source:Uniprot/SWISSPROT;Acc:P34903]
   chromosome_name band strand start_position end_position ensembl_gene_id
1               X  q28     -1      151086290    151370993 ENSG00000011677
 > tempAll[match("ENSG00000011677",tempAll$ensembl_gene_id),]
    ensembl_gene_id start_position end_position strand chromosome_name 
NA            <NA>             NA           NA     NA            <NA> 

Oddly, if I change my main code to filter on chromosome_name but just 
"X", just c("X","Y"), just c(1,"X"), and a couple of other combinations 
I picked then this gene correctly appears. It also appears if I filter 
on 'biotype' equals 'protein_coding'. I won't show all of these results 
unless someone wants, but I just copied and pasted so that was 
definitely the only thing changing.

When I looked, of the 21,021 genes on chr1-22,X,Y brought down with 
filter of 'biotype' equals 'protein_coding', only 16,236 of them were in 
my main query that limited by chromosome ('tempAll' above). The ~5,000 
missing ones are only in chr 5-9 and X,Y. I'm thinking there is some 
matching problem going on but I don't know where (and if it's my error 
or not).

For now I'm just pulling it all down and filtering myself, but I would 
like to know what's going on here.


More information about the Bioconductor mailing list