[BioC] Biostrings - vcountPattern optimization

Thu Jul 22 19:11:28 CEST 2010

Erik,
Have you tried vcountPDict? It will use an Aho - Corasick matching 
algorithm 
(http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm) 
that is pretty fast, albeit memory intensive.

library(Biostrings)
fragments<- DNAStringSet(c("ACTG","AAAA"))
sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
pdict<- PDict(fragments)
counts<- vcountPDict(pdict, sequence_set)

>  counts
      [,1] [,2]
[1,]    0    0
[2,]    0    0

>  sessionInfo()
R version 2.12.0 Under development (unstable) (2010-07-18 r52554)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biostrings_2.17.26 IRanges_1.7.13

loaded via a namespace (and not attached):
[1] Biobase_2.9.0 tools_2.12.0

Patrick

On 7/22/10 8:54 AM, Erik Wright wrote:
> Hello,
>
> Lately I have been working on counting sequence fragments in larger sets of sequences.  I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases.  Currently I am using the following method to count the number of "hits":
>
> #### start ####
> library(Biostrings)
> fragments<- DNAStringSet(c("ACTG","AAAA"))
> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
>
> for (i in 1:length(fragments)) {
> 	counts<- vcountPattern(fragments[[i]],
> 		sequence_set,
> 		max.mismatch=1)
> 	hits<- length(which(counts>  0))
> 	print(hits)
> }
> #### end ####
>
> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner?  Does anyone have a suggestion for how I can accomplish the same task more efficiently?
>
> Thanks!,
> Erik
>
>
>
>
>    
>> sessionInfo()
>>      
> R version 2.11.0 (2010-04-22)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.16.0 IRanges_1.6.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>