[BioC] biomaRt - batch query for chromosome location to gene identifier?
kakat at mail.rockefeller.edu
Fri Nov 12 04:01:09 CET 2010
I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36.
Cluster ID Strand Chromosome Cluster_Begin Cluster_End
slc754_chr2 + chr2 74295624 74295644
slc4695_chr1 - chr1 203949843 203949866
slc2213_chr8 - chr8 103464182 103464207
slc1866_chr17 - chr17 53272097 53272117
slc1642_chr12 - chr12 11150173 11150194
For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all?
It is easy for a single entry:
> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36
However, so far I have failed to make a batch query out of it.
I imported/created the following 1 column data frame with the localization formatted as necessary
I have two points where I failed:
1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run:
> geneid <-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54)
I get 19733 gene ids; my dataset actually has only 161 locations.
2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error.
I tried splitting the localization infos into separate vectors, i.e. chromosome <- c(2,1,8,17,12,...), start <- c(74295624,....), end <- c(...), strand <- c(...) and modified my query:
> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54)
But this seems to combine the information in the different vectors as the result is over 20.000 entries.
Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!?
Any help/idea appreciated!
Dr. med. Kemal Akat
Laboratory of RNA Molecular Biology
The Rockefeller University
1230 York Avenue, Box #186
New York, NY 10065
More information about the Bioconductor