[BioC] biomaRt - batch query for chromosome location to gene identifier?

Fri Nov 12 04:01:09 CET 2010

Hi all,

I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36.

Cluster ID		Strand	Chromosome	Cluster_Begin	Cluster_End
slc754_chr2		+		chr2		74295624		74295644
slc4695_chr1		-		chr1			203949843		203949866
slc2213_chr8		-		chr8		103464182		103464207
slc1866_chr17	-		chr17		53272097		53272117
slc1642_chr12	-		chr12		11150173			11150194
...

For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all?

It is easy for a single entry:

> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36
> geneid
 entrezgene
1      10797
>

However, so far I have failed to make a batch query out of it.

I imported/created the following 1 column data frame with the localization formatted as necessary

> tdp
                 		 chromosomal_cocation
slc754_chr2      	 2,74295624,74295644,1
slc4695_chr1   	1,203949843,203949866,-1
slc2213_chr8   	8,103464182,103464207,-1
slc1866_chr17   	17,53272097,53272117,-1
slc1642_chr12   	12,11150173,11150194,-1
...

I have two points where I failed:

1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run:

> geneid <-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54)

I get 19733 gene ids; my dataset actually has only 161 locations. 

2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error. 
I tried splitting the localization infos into separate vectors, i.e. chromosome <- c(2,1,8,17,12,...), start <- c(74295624,....), end <- c(...), strand <- c(...) and modified my query:

> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54)

But this seems to combine the information in the different vectors as the result is over 20.000 entries.

Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!?

Any help/idea appreciated!

Thank you,
Kemal

Dr. med. Kemal Akat
Postdoctoral Fellow
Laboratory of RNA Molecular Biology
The Rockefeller University
1230 York Avenue, Box #186
New York, NY 10065