[Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)

Fri Jul 23 17:28:25 CEST 2010

It works. It produces and answer in under 2 minutes. I will flesh it
out a bit for posterity.

Some slight modifications must be applied. First, mySeq cannot be of
class character. So, it works if I do

countPDict(PDict(sread(A)), as(mySeq, "DNAString"))

Now, that function outputs how many times a tag is found in mySeq. To
compute how many tags match mySeq once or more, I have to do

sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0)

By the way, this could have been done with perl or python or any other
tools. However, it helps to learn how to do it efficiently from within
the Bioconductor.

Thank you.

Ivan

On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
> Ivan,
> How about
>
> countPDict(PDict(sread(A)), mySeq)
>
>
> Patrick
>
>
> On 7/23/10 7:45 AM, Ivan Gregoretti wrote:
>>
>> Hello Patrick,
>>
>> The idea of vcountPattern is good but it does not quite work for two
>> reasons
>>
>> 1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error
>>
>>
>>>
>>> vcountPattern(mySeq, sread(A))
>>>
>>
>> Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels,  :
>>   patterns with more than 20000 letters are not supported
>>
>> 2) vcountPattern is designed to find a motif (small) contained in a
>> genome (large), like this
>> vcountPattern("GCCACCAGGGGGCGC", Mmusculus)
>>
>> In my case, I have millions of motifs (the 36 bp tags) that I have to
>> find if they are contained in my single ~40kb. Its like a reverse
>> scenario. So, if I try reversing the arguments, I also get an error:
>>
>>
>>>
>>> vcountPattern(sread(A), mySeq)
>>>
>>
>> Error in normargPattern(pattern, subject) :
>>   'pattern' must be a single string or an XString object
>>
>> Any more suggestions?
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>>
>>> sessionInfo()
>>>
>>
>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8
>>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>> LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>> LC_ADDRESS=C
>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] annotate_1.27.1      AnnotationDbi_1.11.4 Biobase_2.9.0
>> ShortRead_1.7.9
>> [5] Rsamtools_1.1.8      lattice_0.18-8       Biostrings_2.17.24
>> GenomicRanges_1.1.17
>> [9] IRanges_1.7.12
>>
>> loaded via a namespace (and not attached):
>> [1] DBI_0.2-5     grid_2.12.0   hwriter_1.2   RSQLite_0.9-1 xtable_1.5-6
>> an
>>
>
>