[BioC] more stupid *Ranges questions...
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Sep 20 15:22:21 CEST 2011
Hi Tim,
One way I think you can speed up your code is to nuke the sapply's
since the alphabetFrequency and dinucleotideFrequency are "vectorized"
over views, code below:
2011/9/20 Tim Triche, Jr. <tim.triche at gmail.com>:
> You're awesome. I tried adapting some of your code to scan hg19 for
> expected (i.e., C*G) probabilities and observed CG probabilities. Let me
> know if I could do this more efficiently (i.e. seconds rather than minutes).
> I'm still pretty happy with it -- takes a few minutes on my laptop, but I
> bet some sort of not-very-secret BSapply() that I haven't discovered yet
> could make it fly.
>
> library(multicore)
> library(BSgenome.Hsapiens.UCSC.hg19)
> # library(SNPlocs.Hsapiens.dbSNP.20110815) # for picking off [CG] SNPs
> library(IlluminaHumanMethylation450kprobe) # will upload & release ASAP
>
> probe.oecg.by.chrom =
> mclapply(levels(IlluminaHumanMethylation450kprobe$CHR),
> function(i) {
> # a one-chromosome-per-core setup
> chrname = paste('chr', i, sep='')
> chr = Hsapiens[[chrname]]
> chrprobes = which(IlluminaHumanMethylation450kprobe$CHR == i)
> probecpgs = with(IlluminaHumanMethylation450kprobe[chrprobes,],
> IRanges(start=MAPINFO, end=MAPINFO, names=Probe_ID))
> cpgwindows = resize(probecpgs, fix="center", width=500)
> chr.seqs = Views(chr, cpgwindows)
> ocg = sapply(chr.seqs, function(v) {
> dinucleotideFrequency(v,as.prob=T)
> })['CG',]
> c.g = sapply(chr.seqs, function(v) {
> alphabetFrequency(v,as.prob=T,baseOnly=T)
> })
> ecg = c.g['C',] * c.g['G',]
> ocg/ecg
> })
I just took 15000 random intervals on chr10 as the IRanges to
construct cpgwindows and created `chr.seqs` from these. Consider:
R> system.time(ocg <- sapply(chr.seqs, function(v) {
dinucleotideFrequency(v, as.prob=TRUE)
})['CG',])
user system elapsed
18.583 0.053 19.13
vs.
R> system.time(ocg2 <- dinucleotideFrequency(chr.seqs, as.prob=TRUE)[,'CG'])
user system elapsed
0.033 0.000 0.033
R> all.equal(ocg, ocg2)
[1] TRUE
You can do the same trick with your c.g calc, eg. from this:
R> c.g = sapply(chr.seqs, function(v) {
alphabetFrequency(v,as.prob=T,baseOnly=T)
})
to this:
c.g <- alphabetFrequency(chr.seqs, as.prob=TRUE, baseOnly=TRUE)
and you should see a nice boost in speed.
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list