[BioC] biomaRt - batch query for chromosome location to gene identifier?
Wolfgang Huber
whuber at embl.de
Fri Nov 12 12:22:30 CET 2010
Kemal,
thanks for the feedback! Would you mind being so helpful and provide the
actual code that you tried? The line that you claim to have tried is
syntactically incorrect (or incomplete) in R, so it would never have
run, and the definition of the object 'ensembl54' is missing. Also,
please think of including the output of 'sessionInfo()'.
Best wishes
Wolfgang
Kemal Akat scripsit 12/11/10 04:01:
> Hi all,
>
> I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36.
>
> Cluster ID Strand Chromosome Cluster_Begin Cluster_End
> slc754_chr2 + chr2 74295624 74295644
> slc4695_chr1 - chr1 203949843 203949866
> slc2213_chr8 - chr8 103464182 103464207
> slc1866_chr17 - chr17 53272097 53272117
> slc1642_chr12 - chr12 11150173 11150194
> ...
>
> For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all?
>
> It is easy for a single entry:
>
>> geneid<-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36
>> geneid
> entrezgene
> 1 10797
>>
>
> However, so far I have failed to make a batch query out of it.
>
> I imported/created the following 1 column data frame with the localization formatted as necessary
>
>> tdp
> chromosomal_cocation
> slc754_chr2 2,74295624,74295644,1
> slc4695_chr1 1,203949843,203949866,-1
> slc2213_chr8 8,103464182,103464207,-1
> slc1866_chr17 17,53272097,53272117,-1
> slc1642_chr12 12,11150173,11150194,-1
> ...
>
> I have two points where I failed:
>
> 1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run:
>
>> geneid<-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54)
>
> I get 19733 gene ids; my dataset actually has only 161 locations.
>
> 2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error.
> I tried splitting the localization infos into separate vectors, i.e. chromosome<- c(2,1,8,17,12,...), start<- c(74295624,....), end<- c(...), strand<- c(...) and modified my query:
>
>> geneid<-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54)
>
> But this seems to combine the information in the different vectors as the result is over 20.000 entries.
>
> Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!?
>
> Any help/idea appreciated!
>
> Thank you,
> Kemal
>
> Dr. med. Kemal Akat
> Postdoctoral Fellow
> Laboratory of RNA Molecular Biology
> The Rockefeller University
> 1230 York Avenue, Box #186
> New York, NY 10065
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list