[BioC] what is the best way to get scores for matches from matchPWM() ?
Lucas Carey
lucas.carey at gmail.com
Wed Jan 20 22:28:54 CET 2010
Hi Patrick,
Thanks. Your method, with starting.at = start(hits) brings the function (in a small test case) from 6sec down to 1sec.
-Lucas
On Jan 20, 2010, at 1:56 PM, Patrick Aboyoun wrote:
> Lucas,
> Your e-mail was very timely since I just had my hands in the matchPWM code in BioC 2.6.
>
> First if you are using BioC 2.5/R-2.10, try modifying your first sapply loop to become an lapply loop to see if there is any performance gain:
>
> mmf <- lapply(1:Nchr, function(chr) {
> subject <- genome[[chr]]
> hits <- matchPWM(pwm, subject, min.score = cutoff)
> scores <- PWMscoreStartingAt(pwm, subject, starting.at = start(hits))
> list(hits = as(hits, "IRanges"), scores = scores)
> })
>
>
> If you are more adventurous and are using BioC 2.6/R-devel, I have just updated the matchPWM method for BSgenome objects in the BSgenome package to include the PWM score in the output so the code is much simpler. This updated BSgenome package in BioC 2.6 will be available from bioconductor.org on Thursday or available from svn now.
>
> > suppressMessages(library(BSgenome))
> > library(BSgenome.Celegans.UCSC.ce2)
> > data(HNF4alpha)
> > pwm <- PWM(HNF4alpha)
> > matchPWM(pwm, Celegans)
> RangedData with 154706 rows and 3 value columns across 7 spaces
> space ranges | strand score string
> <character> <IRanges> | <Rle> <numeric> <DNAStringSet>
> 1 chrI [ 3596, 3608] | + 0.8017528 GGGGCAAAATTAG
> 2 chrI [ 3880, 3892] | + 0.8291304 GGATTAGAGTTCT
> 3 chrI [ 5158, 5170] | + 0.8208024 ATTTCAAAGTTTT
> 4 chrI [ 6128, 6140] | + 0.8540097 GGTGCAAACGTCT
> 5 chrI [10039, 10051] | + 0.8078732 GGTCTAAAATTCT
> 6 chrI [10201, 10213] | + 0.8549414 AGACGAGAGGTCA
> 7 chrI [11388, 11400] | + 0.8137701 GGCACATAGGTCA
> 8 chrI [12708, 12720] | + 0.8624401 GGGGTAAAGTCAA
> 9 chrI [13190, 13202] | + 0.8361537 AGTGTAAAGATCT
> 10 chrI [15609, 15621] | + 0.8061359 GGAGAAAAGGCAA
> ...
> <154696 more rows>
> > sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-01-18 r50995)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4 [3] Biostrings_2.15.18 IRanges_1.5.29
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.3 tools_2.11.0
>
>
>
>
>
> Lucas Carey wrote:
>> Hi All,
>> I'm wondering what is the best way to get the score for every match
>> from matchPWM() in Biostrings
>>
>> Right now, to score all matches to pwm in genome I do this:
>>
>> #Find PWM hits for fwd & reverse complement of PWM for all chromosomes in genome
>> mmf <- sapply(1:Nchr,
>> function(chr){matchPWM(pwm,genome[[chr]],min.score=cutoff) } )
>> mmr <- sapply(1:Nchr,
>> function(chr){matchPWM(reverseComplement(pwm),genome[[chr]],min.score=cutoff)
>> } )
>> mmm <- c(mmf,mmr)
>>
>> #Extract the sequences. RevComp where necessary.
>> Sequences <- c( rapply(mmf,as.character,how='unlist'),
>> sapply(rapply(mmr,as.character,how='unlist'),function(x){c2s(rev(comp(s2c(x))))})
>> )
>>
>> #convert to DNAStringSet for in order to score. This is quite slow
>> lcl_set <- DNAStringSet(as.character(Sequences))
>> Scores <- sapply(lcl_set,PWMscoreStartingAt,pwm=pwm)
>>
>> This is incredibly inefficient. What is the best way to do this?
>>
>> thanks
>>
>> -Lucas
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
More information about the Bioconductor
mailing list