[BioC] biomaRt - batch query for chromosome location to gene identifier?

Fri Nov 12 12:22:30 CET 2010

Kemal,

thanks for the feedback! Would you mind being so helpful and provide the 
actual code that you tried? The line that you claim to have tried is 
syntactically incorrect (or incomplete) in R, so it would never have 
run, and the definition of the object 'ensembl54' is missing. Also, 
please think of including the output of 'sessionInfo()'.

	Best wishes
	Wolfgang

Kemal Akat scripsit 12/11/10 04:01:
> Hi all,
>
> I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36.
>
> Cluster ID		Strand	Chromosome	Cluster_Begin	Cluster_End
> slc754_chr2		+		chr2		74295624		74295644
> slc4695_chr1		-		chr1			203949843		203949866
> slc2213_chr8		-		chr8		103464182		103464207
> slc1866_chr17	-		chr17		53272097		53272117
> slc1642_chr12	-		chr12		11150173			11150194
> ...
>
> For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all?
>
> It is easy for a single entry:
>
>> geneid<-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36
>> geneid
>   entrezgene
> 1      10797
>>
>
> However, so far I have failed to make a batch query out of it.
>
> I imported/created the following 1 column data frame with the localization formatted as necessary
>
>> tdp
>                   		chromosomal_cocation
> slc754_chr2      	 2,74295624,74295644,1
> slc4695_chr1   	1,203949843,203949866,-1
> slc2213_chr8   	8,103464182,103464207,-1
> slc1866_chr17   	17,53272097,53272117,-1
> slc1642_chr12   	12,11150173,11150194,-1
> ...
>
> I have two points where I failed:
>
> 1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run:
>
>> geneid<-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54)
>
> I get 19733 gene ids; my dataset actually has only 161 locations.
>
> 2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error.
> I tried splitting the localization infos into separate vectors, i.e. chromosome<- c(2,1,8,17,12,...), start<- c(74295624,....), end<- c(...), strand<- c(...) and modified my query:
>
>> geneid<-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54)
>
> But this seems to combine the information in the different vectors as the result is over 20.000 entries.
>
> Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!?
>
> Any help/idea appreciated!
>
> Thank you,
> Kemal
>
> Dr. med. Kemal Akat
> Postdoctoral Fellow
> Laboratory of RNA Molecular Biology
> The Rockefeller University
> 1230 York Avenue, Box #186
> New York, NY 10065
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber