[BioC] biomaRt - batch query for chromosome location to gene identifier?
Wolfgang Huber
whuber at embl.de
Sun Nov 14 20:48:42 CET 2010
Dear Kemal,
thank you for your explanation - I understand the 'conceptual' nature of
your post. Just, if someone that wants to help would like to start from
what you have tried and improve on it or make suggestions on it, it's
more efficient if the starting points agree as much as possible.
I made some experiments with your query to the Ensembl BioMart (see
attached, try different choices for 'sel'), and, like you, also could
not get meaningful results out if. 'getBM' returns a huge results
dataframe whose origin or rationale I fail to understand. Maybe one of
the Ensembl / BioMart experts can say something?
I agree with Vince & Martin, however, that the GenomicRanges-based
approach suggested by Vince is much more appropriate here, for many
reasons, including flexibility and speed.
Best wishes
Wolfgang
Il Nov/13/10 1:19 AM, Kemal Akat ha scritto:
> Dear Vincent and Martin,
>
> thank you for your help and explanations. I will try your suggestions.
>
> Dear Wolfgang,
>
> sorry if the info I posted was incomplete. It was more a semantic explanation than a technical one. I realized the incorrect syntax, but that was just a typo (as I couldn't copy and paste back then). I'll try to be more precise in the future.
>
> For the sake of completeness here is the actual code I was running:
>
> 1) with one filter, referring to the column of the data frame
>
>> options(width = 800, max.print = 5E5)
>> library(biomaRt)
>> ensembl54<- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org", path = "/biomart/martservice", archive = FALSE)
>> tdp<- read.delim("/Users/Kemal/Desktop/Projects/biomaRt/tdp.txt", row.names = 1)
>> genes<- getBM(attributes = "entrezgene", filters = c("chromosomal_region"), values = list(tdp$Chromosomal_Location), mart = ensembl54)
>> genes
> entrezgene
> 1 6964
> 2 651536
> 3 445347
> 4 3492
> 5 100133739
> 6 652494
> ...
> 19731 55657
> 19732 267002
> 19733 692312
>> sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C/en_US.UTF-8/C/C/C/C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] biomaRt_2.6.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.4-3 XML_3.2-0
>>
>
> 2) with multiple filters and the localization info splitted into 4 separate vectors
>
>> options(width = 800, max.print = 5E5) # change display settings to allow larger data frames, modify as needed
>> library(biomaRt) # load the biomaRt package
>> #ensembl<- useMart("ensembl", dataset = "hsapiens_gene_ensembl") # to assign the variable ensembl (or else) to hg19, NCBI build 37
>> ensembl54<- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "may2009.archive.ensembl.org", path = "/biomart/martservice", archive = FALSE) # to use the hg18, NCBI build 36
>>
>> chromosome<-c(2, 1, 8, 17, 12, 10, 21, 2, 7, 16, 5, 4, 1, 19, 13, 13, 10, 7, 15, 7, 2, 12, 21, 20, 5, 11, 15, 12, 17, 3, 17, 14, 19, 13, 6, 14, 11, 13, 2, 20, 7, 10, 1, 16, "X", 22, 20, 1, 3, 8, 4, 1, 6, 15, 17, 4, 12, 7, 1, 14, 12, 17, 12, 6, 9, 22, "X", 7, 12, 10, 19, 5, 1, 8, 11, 8, 19, 7, 6, 5, 1, 6, 9, 19, 10, "X", 3, 5, 13, 17, 20, 3, 16, 5, 13, 12, 15, 19, 4, 16, 10, 8, 7, 7, 12, 6, 11, 21, 17, "X", 15, 10, 16, 15, 9, 5, 2, 6, 12, 5, 14, 14, 6, 6, 15, 4, 1, 9, 8, 1, 5, "X", 11, 2, 1, 19, 2, 2, 13, 1, 17, 13, "X", 13, 7, 11, 3, "X", 15, 17, 22, 11, 16, 19, 7, 2, 13, 9, 14, 12, 1)
>>
>> strand<- c(1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, 1, -1, -1, -1, 1, 1, -1, 1, -1, -1, 1, 1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, -1, 1, -1, -1, 1, -1, 1, -1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, -1, -1, 1, 1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, -1, -1, 1, -1, 1, 1, -1, 1, 1, -1)
>>
>> start<- c(74295624, 203949843, 103464182, 53272097, 11150173, 52616450, 29638098, 160316091, 26218430, 73763529, 118611423, 78176046, 39566219, 45430361, 34823356, 114106490, 85902288, 55407132, 97080019, 97322520, 241940312, 49774225, 43149920, 1298551, 71540977, 2967925, 75125438, 107566872, 32012029, 50130463, 24612590, 103222346, 49628998, 76795050,47700337, 61273382, 64288817, 99433818, 86857893, 60954640, 121562796, 102024852, 233341511, 46963400, 54115224, 17405158, 33706984, 233341179, 11574528, 117938795, 100202920, 191280320, 89641039, 91241813, 27712528, 72107017, 45043151, 115686015, 227526472, 59794222, 132193041, 2534782, 49500468, 64462329, 19040649, 34020658, 118261386, 5071002,47616321, 101146860, 41419808, 145858601, 9912662, 109329554, 47446160, 130922997, 42311362, 135263153, 64348945, 32179718, 224044250, 112127969, 37126090, 41697757, 7442786, 24005946, 52264406, 154175486, 76478566, 20845822, 30384091, 9561760, 55844512, 65510903, 73435394, 52920938!
> , 42797538, 45273560, 174489559, 55535115, 847112, 26283582,79686310, 138640447, 751963, 124334398, 122435615, 29638867, 5277351, 46804967, 39560795, 89503999, 55844568, 42797166, 113346669, 139476285, 74236497, 43716256, 52725397, 137763505, 31647833, 101463626, 136622538, 76608878, 66287317, 174489713, 32145213, 114023332, 61632141, 160759702, 56575893, 73723822, 92851504, 74295579, 20108627, 51970080,86384720, 174692895, 76479389, 233342144, 19819930, 47962168, 70434946, 33344305, 135285934, 64288869, 143115273, 100555421, 40493944, 44486723, 37016836, 27475187, 14437226, 40184189, 104928588, 38862574, 96917747, 114023291, 89944103, 67952950, 223656250)
>>
>> end<- c(74295644, 203949866, 103464207, 53272117, 11150194, 52616474, 29638121, 160316115, 26218451, 73763550, 118611442, 78176077, 39566244, 45430401, 34823379, 114106513, 85902314, 55407154, 97080040, 97322541, 241940335, 49774248, 43149944, 1298573, 71540997, 2967950, 75125461, 107566904, 32012052, 50130483, 24612615, 103222367, 49629021, 76795071, 47700356, 61273403, 64288842, 99433838, 86857915, 60954662, 121562817, 102024873, 233341535, 46963421, 54115248, 17405178, 33707009, 233341202, 11574549, 117938820, 100202946, 191280339, 89641061, 91241835, 27712550, 72107038, 45043175, 115686058, 227526491, 59794245, 132193064, 2534802, 49500489, 64462352, 19040675, 34020682, 118261409, 5071024, 47616347, 101146884, 41419832, 145858626, 9912686, 109329575, 47446181, 130923017, 42311383, 135263177, 64348968, 32179741, 224044275, 112127989, 37126110, 41697783, 7442807, 24005968, 52264426, 154175508, 76478587, 20845844, 30384113, 9561782, 55844532, 65510922, 73435415, 52920963!
> , 42797563, 45273580, 174489628, 55535135, 847134, 26283606, 79686336, 138640468, 751985, 124334422, 122435636, 29638895, 5277374, 46804991, 39560819, 89504022, 55844595, 42797186, 113346688, 139476306, 74236521, 43716279, 52725418, 137763531, 31647855, 101463647, 136622560, 76608900, 66287361, 174489736, 32145234, 114023358, 61632162, 160759726, 56575914, 73723844, 92851527, 74295600, 20108650, 51970105, 86384744, 174692915, 76479410, 233342163, 19819951, 47962190, 70434970, 33344326, 135285963, 64288896, 143115294, 100555442, 40493968, 44486745, 37016858, 27475207, 14437254, 40184209, 104928610, 38862598, 96917795, 114023316, 89944124, 67952970, 223656273)
>>
>> genes<- getBM(attributes = "entrezgene", filters = c("chromosome_name", "start", "end", "strand"), values = list(chromosome, start, end, strand), mart = ensembl54)
>> genes
> entrezgene
> 1 6964
> 2 651536
> 3 445347
> 4 3492
> ...
> 19073 10806
> 19074 10000
> 19075 692312
>> sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C/en_US.UTF-8/C/C/C/C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] biomaRt_2.6.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.4-3 XML_3.2-0
>>
>
> Kind regards,
> Kemal
>
>
> Am 12.11.2010 um 09:16 schrieb Martin Morgan:
>
>> On 11/12/2010 04:18 AM, Vincent Carey wrote:
>>> tx18 = transcripts(hg18.txdb)
>>>> kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name
>>
>> Better to use the accessor matchMatrix(findOveralaps(kem, tx18))
>>
>> Martin
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: kemal.R
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20101114/5e745619/attachment.pl>
More information about the Bioconductor
mailing list