[BioC] Problem using biomaRt to retrieve human SNPs given a list of gene symbols

Yong Li mail.yong.li at googlemail.com
Thu May 10 15:02:54 CEST 2012

Dear all,

I have a task that given a list of hundreds human genes, retrieve the
SNPs located in these genes. Using biomaRt seems to be a good option.
I though to first get the chromosome locations of the genes and then
find the SNPs in these regions. My codes is as the following:

# start my R code
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
dbsnp <- useMart("snp", dataset = "hsapiens_snp")

# gene_symbols.txt is the file that has the list of gene symbols.
genes <- read.table("./gene_symbols.txt")
genes <- genes$V1
genes <- genes[1:50]

locations <- getBM(attributes=c('ensembl_gene_id', 'hgnc_symbol',
    'start_position', 'end_position', 'strand'), filters =
'hgnc_symbol', values = genes,
    mart = ensembl)

snps <- getBM(c('refsnp_id','allele','chrom_start','chrom_strand',
    'consequence_type_tv'), filters = c('chr_name',
    'chrom_start', 'chrom_end'), values = list(locations$chromosome_name,
    locations$start_position, locations$end_position), mart = dbsnp)
# end my R code

The step of using getBM to get the locations is extremely fast. But
the step to get the snps never finishes, even when I limit my gene
list to 50. Does anyone has an idea of the reason for this? Or any
suggestions to solve this problem using other ways/packages?

Thanks in advance!


PS: my sessioninfo.

> sessionInfo()
R version 2.14.2 (2012-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.10.0

loaded via a namespace (and not attached):
[1] RCurl_1.91-1 tools_2.14.2 XML_3.9-4

More information about the Bioconductor mailing list