[BioC] what is the best way to get scores for matches from matchPWM() ?

Wed Jan 20 19:56:24 CET 2010

Lucas,
Your e-mail was very timely since I just had my hands in the matchPWM 
code in BioC 2.6.

First if you are using BioC 2.5/R-2.10, try modifying your first sapply 
loop to become an lapply loop to see if there is any performance gain:

mmf <- lapply(1:Nchr, function(chr) {
    subject <- genome[[chr]]
    hits <- matchPWM(pwm, subject, min.score = cutoff)
    scores <- PWMscoreStartingAt(pwm, subject, starting.at = start(hits))
    list(hits = as(hits, "IRanges"), scores = scores)
})

If you are more adventurous and are using BioC 2.6/R-devel, I have just 
updated the matchPWM method for BSgenome objects in the BSgenome package 
to include the PWM score in the output so the code is much simpler. This 
updated BSgenome package in BioC 2.6 will be available from 
bioconductor.org on Thursday or available from svn now.

 > suppressMessages(library(BSgenome))
 > library(BSgenome.Celegans.UCSC.ce2)
 > data(HNF4alpha)
 > pwm <- PWM(HNF4alpha)
 > matchPWM(pwm, Celegans)
RangedData with 154706 rows and 3 value columns across 7 spaces
         space         ranges | strand     score         string
   <character>      <IRanges> |  <Rle> <numeric> <DNAStringSet>
1         chrI [ 3596,  3608] |      + 0.8017528  GGGGCAAAATTAG
2         chrI [ 3880,  3892] |      + 0.8291304  GGATTAGAGTTCT
3         chrI [ 5158,  5170] |      + 0.8208024  ATTTCAAAGTTTT
4         chrI [ 6128,  6140] |      + 0.8540097  GGTGCAAACGTCT
5         chrI [10039, 10051] |      + 0.8078732  GGTCTAAAATTCT
6         chrI [10201, 10213] |      + 0.8549414  AGACGAGAGGTCA
7         chrI [11388, 11400] |      + 0.8137701  GGCACATAGGTCA
8         chrI [12708, 12720] |      + 0.8624401  GGGGTAAAGTCAA
9         chrI [13190, 13202] |      + 0.8361537  AGTGTAAAGATCT
10        chrI [15609, 15621] |      + 0.8061359  GGAGAAAAGGCAA
...
<154696 more rows>
 > sessionInfo()
R version 2.11.0 Under development (unstable) (2010-01-18 r50995)
i386-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4                 
[3] Biostrings_2.15.18                IRanges_1.5.29                  

loaded via a namespace (and not attached):
[1] Biobase_2.7.3 tools_2.11.0

Lucas Carey wrote:
> Hi All,
> I'm wondering what is the best way to get the score for every match
> from matchPWM() in Biostrings
>
> Right now, to score all matches to pwm in genome I do this:
>
> #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome
> mmf <- sapply(1:Nchr,
> function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) }  )
> mmr <- sapply(1:Nchr,
> function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.score=cutoff)
> }  )
> mmm <- c(mmf,mmr)
>
> #Extract the sequences. RevComp where necessary.
> Sequences <-  c( rapply(mmf,as.character,how='unlist'),
> sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(comp(s2c(x))))})
> )
>
> #convert to DNAStringSet for in order to score. This is quite slow
> lcl_set  <- DNAStringSet(as.character(Sequences))
> Scores  <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm)
>
> This is incredibly inefficient. What is the best way to do this?
>
> thanks
>
> -Lucas
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>