[BioC] pulling functional information for SNPs
James W. MacDonald
jmacdon at med.umich.edu
Wed Apr 28 22:42:28 CEST 2010
Hi Kay,
Kay Jaja wrote:
> Hi ,
>
> I have a list of SNPS (rs numbers ) and I am interested in pulling the functional data corresponding to each SNP from a data base like ensemble, i.e.( is the gene name if the snp i sin a gene, intron, exon, non_ synonymous snp, or synonymous snp).
> is it possible to do this in R using BioMart or any other packages?
Do you mean to ask if it is possible, or is it easy? It is certainly
possible, although it depends on exactly what you want. Your question is
not as complete as it could be. In the future, you should try to explain
exactly what you are trying to do rather than asking open-ended questions.
You can get information about SNPs using biomaRt, but the available
information looks pretty sparse to me when compared to the small list of
interests you seem to have. But you can look to see what is available
easily enough:
library(biomaRt)
mart <- useMart("snp","hsapiens_snp")
listAttributes(mart)
There are one or two vignettes that come with biomaRt that should help
you get started if you like what you see.
I generally don't use biomaRt for this sort of thing, instead preferring
to hit the UCSC database directly. Note that what I show below might be
done as easily using the rtracklayer package; you might explore the
vignettes for that package as well. Anyway, I would use the RMySQL
package and query directly:
library(RMySQL)
con <- dbConnect("MySQL", host = "genome-mysql.cse.ucsc.edu", dbname =
"hg18", user = "genome")
## what type of info is available?
> dbGetQuery(con, "select * from snp129 where name='rs25';")
bin chrom chromStart chromEnd name score strand refNCBI refUCSC observed
1 673 chr7 11550666 11550667 rs25 0 - T T A/G
molType class valid avHet avHetSE func
1 genomic single by-cluster,by-frequency,by-hapmap 0.499586 0.014383 intron
locType weight
1 exact 1
Note two things here. First, you don't know the return order, so you
should always ask for the database to return what you are querying on
(this is true of biomaRt as well). Second, if you are querying lots of
SNPs, just do it in one big query instead of one by one. Repeatedly
querying an online database will get you banned. So say your rs IDs are
in a vector rsid, and you want the chromosome, the position, the bases,
and the function (intron, exon, intragenic, etc).
sql <- paste("select name, chrom, chromEnd, observed, func from snp129
where name in ('", paste(rsid, collapse = "','"), "');", sep = "")
there are a lot of ' and " in there, because we want something that
looks like this:
select name, chrom, chromEnd, observed, func from snp129 where name in
('rs25','rs26','rs27','rs28');
so you want to make sure the sql statement looks OK first. Then just do
dat <- dbGetQuery(con, sql)
> rsid <- c("rs25","rs26","rs27","rs28")
> rsid
[1] "rs25" "rs26" "rs27" "rs28"
> sql <- paste("select name, chrom, chromEnd, observed, func from
snp129 where name in ('", paste(rsid, collapse = "','"), "');", sep = "")
> sql
[1] "select name, chrom, chromEnd, observed, func from snp129 where name
in ('rs25','rs26','rs27','rs28');"
> z <- dbGetQuery(con, sql)
> z
name chrom chromEnd observed func
1 rs25 chr7 11550667 A/G intron
2 rs26 chr7 11549996 -/A/G intron
3 rs27 chr7 11549750 C/G intron
4 rs28 chr7 11562590 A/G intron
Best,
Jim
>
> I appreciate your help,
> thanks
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list