[BioC] more stupid *Ranges questions...

Tue Sep 20 15:22:21 CEST 2011

Hi Tim,

One way I think you can speed up your code is to nuke the sapply's
since the alphabetFrequency and dinucleotideFrequency are "vectorized"
over views, code below:

2011/9/20 Tim Triche, Jr. <tim.triche at gmail.com>:
> You're awesome.  I tried adapting some of your code to scan hg19 for
> expected (i.e., C*G) probabilities and observed CG probabilities.  Let me
> know if I could do this more efficiently (i.e. seconds rather than minutes).
>  I'm still pretty happy with it -- takes a few minutes on my laptop, but I
> bet some sort of not-very-secret BSapply() that I haven't discovered yet
> could make it fly.
>
> library(multicore)
> library(BSgenome.Hsapiens.UCSC.hg19)
> # library(SNPlocs.Hsapiens.dbSNP.20110815) # for picking off [CG] SNPs
> library(IlluminaHumanMethylation450kprobe) # will upload & release ASAP
>
> probe.oecg.by.chrom =
> mclapply(levels(IlluminaHumanMethylation450kprobe$CHR),
> function(i) {
>  # a one-chromosome-per-core setup
>  chrname = paste('chr', i, sep='')
>  chr = Hsapiens[[chrname]]
>  chrprobes = which(IlluminaHumanMethylation450kprobe$CHR == i)
>  probecpgs = with(IlluminaHumanMethylation450kprobe[chrprobes,],
>                   IRanges(start=MAPINFO, end=MAPINFO, names=Probe_ID))
>  cpgwindows = resize(probecpgs, fix="center", width=500)
>  chr.seqs = Views(chr, cpgwindows)
>  ocg = sapply(chr.seqs, function(v) {
>    dinucleotideFrequency(v,as.prob=T)
>  })['CG',]
>  c.g = sapply(chr.seqs, function(v) {
>    alphabetFrequency(v,as.prob=T,baseOnly=T)
>  })
>  ecg = c.g['C',] * c.g['G',]
>  ocg/ecg
> })

I just took 15000 random intervals on chr10 as the IRanges to
construct cpgwindows and created `chr.seqs` from these. Consider:

R> system.time(ocg <- sapply(chr.seqs, function(v) {
   dinucleotideFrequency(v, as.prob=TRUE)
})['CG',])
   user  system elapsed
 18.583   0.053  19.13

vs.

R> system.time(ocg2 <- dinucleotideFrequency(chr.seqs, as.prob=TRUE)[,'CG'])
   user  system elapsed
  0.033   0.000   0.033

R> all.equal(ocg, ocg2)
[1] TRUE

You can do the same trick with your c.g calc, eg. from this:

R> c.g = sapply(chr.seqs, function(v) {
    alphabetFrequency(v,as.prob=T,baseOnly=T)
 })

to this:

c.g <- alphabetFrequency(chr.seqs, as.prob=TRUE, baseOnly=TRUE)

and you should see a nice boost in speed.

HTH,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact