[BioC] what is the best way to get scores for matches from matchPWM() ?

Wed Jan 20 22:28:54 CET 2010

Hi Patrick,
Thanks. Your method, with starting.at = start(hits) brings the function (in a small test case) from 6sec down to 1sec. 
-Lucas

On Jan 20, 2010, at 1:56 PM, Patrick Aboyoun wrote:

> Lucas,
> Your e-mail was very timely since I just had my hands in the matchPWM code in BioC 2.6.
> 
> First if you are using BioC 2.5/R-2.10, try modifying your first sapply loop to become an lapply loop to see if there is any performance gain:
> 
> mmf <- lapply(1:Nchr, function(chr) {
>   subject <- genome[[chr]]
>   hits <- matchPWM(pwm, subject, min.score = cutoff)
>   scores <- PWMscoreStartingAt(pwm, subject, starting.at = start(hits))
>   list(hits = as(hits, "IRanges"), scores = scores)
> })
> 
> 
> If you are more adventurous and are using BioC 2.6/R-devel, I have just updated the matchPWM method for BSgenome objects in the BSgenome package to include the PWM score in the output so the code is much simpler. This updated BSgenome package in BioC 2.6 will be available from bioconductor.org on Thursday or available from svn now.
> 
> > suppressMessages(library(BSgenome))
> > library(BSgenome.Celegans.UCSC.ce2)
> > data(HNF4alpha)
> > pwm <- PWM(HNF4alpha)
> > matchPWM(pwm, Celegans)
> RangedData with 154706 rows and 3 value columns across 7 spaces
>        space         ranges | strand     score         string
>  <character>      <IRanges> |  <Rle> <numeric> <DNAStringSet>
> 1         chrI [ 3596,  3608] |      + 0.8017528  GGGGCAAAATTAG
> 2         chrI [ 3880,  3892] |      + 0.8291304  GGATTAGAGTTCT
> 3         chrI [ 5158,  5170] |      + 0.8208024  ATTTCAAAGTTTT
> 4         chrI [ 6128,  6140] |      + 0.8540097  GGTGCAAACGTCT
> 5         chrI [10039, 10051] |      + 0.8078732  GGTCTAAAATTCT
> 6         chrI [10201, 10213] |      + 0.8549414  AGACGAGAGGTCA
> 7         chrI [11388, 11400] |      + 0.8137701  GGCACATAGGTCA
> 8         chrI [12708, 12720] |      + 0.8624401  GGGGTAAAGTCAA
> 9         chrI [13190, 13202] |      + 0.8361537  AGTGTAAAGATCT
> 10        chrI [15609, 15621] |      + 0.8061359  GGAGAAAAGGCAA
> ...
> <154696 more rows>
> > sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-01-18 r50995)
> i386-apple-darwin9.8.0
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base    
> other attached packages:
> [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4                 [3] Biostrings_2.15.18                IRanges_1.5.29                  
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.3 tools_2.11.0
> 
> 
> 
> 
> 
> Lucas Carey wrote:
>> Hi All,
>> I'm wondering what is the best way to get the score for every match
>> from matchPWM() in Biostrings
>> 
>> Right now, to score all matches to pwm in genome I do this:
>> 
>> #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome
>> mmf <- sapply(1:Nchr,
>> function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) }  )
>> mmr <- sapply(1:Nchr,
>> function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.score=cutoff)
>> }  )
>> mmm <- c(mmf,mmr)
>> 
>> #Extract the sequences. RevComp where necessary.
>> Sequences <-  c( rapply(mmf,as.character,how='unlist'),
>> sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(comp(s2c(x))))})
>> )
>> 
>> #convert to DNAStringSet for in order to score. This is quite slow
>> lcl_set  <- DNAStringSet(as.character(Sequences))
>> Scores  <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm)
>> 
>> This is incredibly inefficient. What is the best way to do this?
>> 
>> thanks
>> 
>> -Lucas
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>  
>