[BioC] pulling functional information for SNPs
James W. MacDonald
jmacdon at med.umich.edu
Tue May 4 20:56:41 CEST 2010
Hi Kay,
Please don't take things off-list. The list archives are intended to be
a useful repository of questions that were asked and answered, and if we
go off-list, that function is invalidated.
Kay Jaja wrote:
> Hi James,
>
> Thank you very much for the response, I have tried to use BiomaRt and
> about 700,000 SNPs on the Affy 6.0 chip to get the chromosome and the
> base pair position but the script is taking forever (more than 24 hours)
> , I have tried a test on a small set of SNPs and it works. below is my
> script
Getting a query for 700K things will likely take a long time. Had you
mentioned that you were using the Affy 6.0 chip, we could have gone in a
different direction.
biocLite("pd.genomewidesnp.6") ## this will take a while
library(pd.genomewidesnp.6)
con <- pd.genomewidesnp.6 at getdb()
## there might be a better way of setting up the connection...
## If so, Benilton will correct me very soon ;-P
## check things out
> dbListTables(con)
[1] "featureSet" "featureSetCNV" "pmfeature" "pmfeatureCNV"
[5] "sequence" "sequenceCNV" "sqlite_stat1" "table_info"
> dbListFields(con, "featureSet")
[1] "fsetid" "man_fsetid" "affy_snp_id"
"dbsnp_rs_id"
[5] "chrom" "physical_pos" "strand"
"cytoband"
[9] "allele_a" "allele_b" "gene_assoc"
"fragment_length"
[13] "fragment_length2" "dbsnp" "cnv"
## try a 70K query - I won't show how I made the snp vector...
> length(snps)
[1] 70000
> system.time(dbGetQuery(con, paste("select dbsnp_rs_id, chrom,
physical_pos from featureSet where dbsnp_rs_id in ('", paste(snps,
collapse = "','"), "');", sep = "")))
user system elapsed
4.89 1.09 119.09
So about 2 min for 70K query. Not bad.
> dbGetQuery(con, "select dbsnp_rs_id, chrom, physical_pos from
featureSet limit 10;")
dbsnp_rs_id chrom physical_pos
1 rs2887286 1 1145994
2 rs1496555 1 2224111
3 rs41477744 1 2319424
4 rs3890745 1 2543484
5 rs10492936 1 2926730
6 rs10489588 1 2941694
7 rs2376495 1 3084986
8 rs4648462 1 3155127
9 rs10492939 1 3292731
10 rs9424283 1 3695086
Best,
Jim
>
> ########################################################
> # snps is a vector of about 700,000 rs numbers from the Affy 6.0 #####
> ######################################################
> library(biomaRt)
> mart <- useMart("snp")
> mart<-useDataset("hsapiens_snp",mart)
> Checking attributes and filters ... ok
>
> out=getBM(attributes=("refsnp_id","chr_name","chrom_start"),
> + filters=c("refsnp_id"), values=snps, mart=mart)
>
> ############################################
>
> is there a reason why you prefer to use RMySQL over biomaRt?
> Do you know why my script is taking too long?
>
> Thanks for your help,
>
> ------------------------------------------------------------------------
> *From:* James W. MacDonald <jmacdon at med.umich.edu>
> *To:* Kay Jaja <kjaja27 at yahoo.com>
> *Cc:* bioconductor at stat.math.ethz.ch
> *Sent:* Wed, April 28, 2010 1:42:28 PM
> *Subject:* Re: [BioC] pulling functional information for SNPs
>
> Hi Kay,
>
> Kay Jaja wrote:
> > Hi ,
> >
> > I have a list of SNPS (rs numbers ) and I am interested in pulling
> the functional data corresponding to each SNP from a data base like
> ensemble, i.e.( is the gene name if the snp i sin a gene, intron, exon,
> non_ synonymous snp, or synonymous snp). is it possible to do this in R
> using BioMart or any other packages?
>
> Do you mean to ask if it is possible, or is it easy? It is certainly
> possible, although it depends on exactly what you want. Your question is
> not as complete as it could be. In the future, you should try to explain
> exactly what you are trying to do rather than asking open-ended questions.
>
> You can get information about SNPs using biomaRt, but the available
> information looks pretty sparse to me when compared to the small list of
> interests you seem to have. But you can look to see what is available
> easily enough:
>
> library(biomaRt)
> mart <- useMart("snp","hsapiens_snp")
> listAttributes(mart)
>
> There are one or two vignettes that come with biomaRt that should help
> you get started if you like what you see.
>
> I generally don't use biomaRt for this sort of thing, instead preferring
> to hit the UCSC database directly. Note that what I show below might be
> done as easily using the rtracklayer package; you might explore the
> vignettes for that package as well. Anyway, I would use the RMySQL
> package and query directly:
>
> library(RMySQL)
> con <- dbConnect("MySQL", host = "genome-mysql.cse.ucsc.edu
> <http://genome-mysql.cse.ucsc.edu/>", dbname = "hg18", user = "genome")
>
> ## what type of info is available?
>
> > dbGetQuery(con, "select * from snp129 where name='rs25';")
> bin chrom chromStart chromEnd name score strand refNCBI refUCSC observed
> 1 673 chr7 11550666 11550667 rs25 0 - T T A/G
> molType class valid avHet avHetSE func
> 1 genomic single by-cluster,by-frequency,by-hapmap 0.499586 0.014383 intron
> locType weight
> 1 exact 1
>
> Note two things here. First, you don't know the return order, so you
> should always ask for the database to return what you are querying on
> (this is true of biomaRt as well). Second, if you are querying lots of
> SNPs, just do it in one big query instead of one by one. Repeatedly
> querying an online database will get you banned. So say your rs IDs are
> in a vector rsid, and you want the chromosome, the position, the bases,
> and the function (intron, exon, intragenic, etc).
>
> sql <- paste("select name, chrom, chromEnd, observed, func from snp129
> where name in ('", paste(rsid, collapse = "','"), "');", sep = "")
>
> there are a lot of ' and " in there, because we want something that
> looks like this:
>
> select name, chrom, chromEnd, observed, func from snp129 where name in
> ('rs25','rs26','rs27','rs28');
>
> so you want to make sure the sql statement looks OK first. Then just do
>
> dat <- dbGetQuery(con, sql)
>
> > rsid <- c("rs25","rs26","rs27","rs28")
> > rsid
> [1] "rs25" "rs26" "rs27" "rs28"
> > sql <- paste("select name, chrom, chromEnd, observed, func from
> snp129 where name in ('", paste(rsid, collapse = "','"), "');", sep = "")
> > sql
> [1] "select name, chrom, chromEnd, observed, func from snp129 where name
> in ('rs25','rs26','rs27','rs28');"
> > z <- dbGetQuery(con, sql)
> > z
> name chrom chromEnd observed func
> 1 rs25 chr7 11550667 A/G intron
> 2 rs26 chr7 11549996 -/A/G intron
> 3 rs27 chr7 11549750 C/G intron
> 4 rs28 chr7 11562590 A/G intron
>
> Best,
>
> Jim
>
>
>
> >
> > I appreciate your help,
> > thanks
> >
> >
> > [[alternative HTML version deleted]]
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -- James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list