[BioC] Biostrings - vcountPattern optimization
Erik Wright
eswright at wisc.edu
Thu Jul 22 19:32:39 CEST 2010
Hi Patrick,
Thanks, this looks promising. Three possible complications are:
(1) The fragments are not all the same width. Is this possible with Pdict?
(2) I allow a variable number of mismatches based on each individual fragment's width.
(3) The fragments sometimes include ambiguity letters (IUPAC extended letters).
A more accurate example would be:
#### start ####
fragments <- DNAStringSet(c("ACS","NCCAGAA")) # no indels
sequence_set <- DNAStringSet(c("ATAGCATACKACCA","GATTACGTACCADADATTACA") # variable widths
for (i in 1:length(fragments)) {
counts <- vcountPattern(fragments[[i]],
sequence_set,
max.mismatch=floor(length(fragments[[i]])/5)) # variable mis-matches
hits <- length(which(counts > 0))
print(hits)
}
#### end ####
Do think it is possible to make this work Pdict for a speed improvement?
Thanks again!,
Erik
On Jul 22, 2010, at 12:11 PM, Patrick Aboyoun wrote:
> Erik,
> Have you tried vcountPDict? It will use an Aho - Corasick matching algorithm (http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm) that is pretty fast, albeit memory intensive.
>
> library(Biostrings)
> fragments<- DNAStringSet(c("ACTG","AAAA"))
> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
> pdict<- PDict(fragments)
> counts<- vcountPDict(pdict, sequence_set)
>
>> counts
> [,1] [,2]
> [1,] 0 0
> [2,] 0 0
>
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-07-18 r52554)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.17.26 IRanges_1.7.13
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.9.0 tools_2.12.0
>
>
>
>
> Patrick
>
>
> On 7/22/10 8:54 AM, Erik Wright wrote:
>> Hello,
>>
>> Lately I have been working on counting sequence fragments in larger sets of sequences. I am searching for thousands of fragments of 30 to 130 bases in hundreds of thousands of sequences between 1200 and 1600 bases. Currently I am using the following method to count the number of "hits":
>>
>> #### start ####
>> library(Biostrings)
>> fragments<- DNAStringSet(c("ACTG","AAAA"))
>> sequence_set<- DNAStringSet(c("TAGACATGAC","ACCAAATGAC"))
>>
>> for (i in 1:length(fragments)) {
>> counts<- vcountPattern(fragments[[i]],
>> sequence_set,
>> max.mismatch=1)
>> hits<- length(which(counts> 0))
>> print(hits)
>> }
>> #### end ####
>>
>> This method is taking a long time to complete, so I am wondering if I am doing this in the most efficient manner? Does anyone have a suggestion for how I can accomplish the same task more efficiently?
>>
>> Thanks!,
>> Erik
>>
>>
>>
>>
>>
>>> sessionInfo()
>>>
>> R version 2.11.0 (2010-04-22)
>> x86_64-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Biostrings_2.16.0 IRanges_1.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.8.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
More information about the Bioconductor
mailing list