[Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)

Ivan Gregoretti ivangreg at gmail.com
Fri Jul 23 18:03:05 CEST 2010


Indeed, its slightly faster: 103 seconds. countPDcit() was 117 seconds.

Excellent. Thank you again.

Ivan



On Fri, Jul 23, 2010 at 11:52 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
> Ivan,
> Sorry for the evolving answer, but this may prove to be faster
>
> length(whichPDict(PDict(sread(A)), as(mySeq, "DNAString")))
>
>
> Patrick
>
>
> On 7/23/10 8:28 AM, Ivan Gregoretti wrote:
>>
>> It works. It produces and answer in under 2 minutes. I will flesh it
>> out a bit for posterity.
>>
>> Some slight modifications must be applied. First, mySeq cannot be of
>> class character. So, it works if I do
>>
>> countPDict(PDict(sread(A)), as(mySeq, "DNAString"))
>>
>> Now, that function outputs how many times a tag is found in mySeq. To
>> compute how many tags match mySeq once or more, I have to do
>>
>> sum(countPDict(PDict(sread(A)), as(mySeq, "DNAString"))!=0)
>>
>>
>> By the way, this could have been done with perl or python or any other
>> tools. However, it helps to learn how to do it efficiently from within
>> the Bioconductor.
>>
>> Thank you.
>>
>> Ivan
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:56 AM, Patrick Aboyoun<paboyoun at fhcrc.org>
>>  wrote:
>>
>>>
>>> Ivan,
>>> How about
>>>
>>> countPDict(PDict(sread(A)), mySeq)
>>>
>>>
>>> Patrick
>>>
>>>
>>> On 7/23/10 7:45 AM, Ivan Gregoretti wrote:
>>>
>>>>
>>>> Hello Patrick,
>>>>
>>>> The idea of vcountPattern is good but it does not quite work for two
>>>> reasons
>>>>
>>>> 1) mySeq is ~40kb. That is quite big and vcountPattern() throws the
>>>> error
>>>>
>>>>
>>>>
>>>>>
>>>>> vcountPattern(mySeq, sread(A))
>>>>>
>>>>>
>>>>
>>>> Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels,
>>>>  :
>>>>   patterns with more than 20000 letters are not supported
>>>>
>>>> 2) vcountPattern is designed to find a motif (small) contained in a
>>>> genome (large), like this
>>>> vcountPattern("GCCACCAGGGGGCGC", Mmusculus)
>>>>
>>>> In my case, I have millions of motifs (the 36 bp tags) that I have to
>>>> find if they are contained in my single ~40kb. Its like a reverse
>>>> scenario. So, if I try reversing the arguments, I also get an error:
>>>>
>>>>
>>>>
>>>>>
>>>>> vcountPattern(sread(A), mySeq)
>>>>>
>>>>>
>>>>
>>>> Error in normargPattern(pattern, subject) :
>>>>   'pattern' must be a single string or an XString object
>>>>
>>>> Any more suggestions?
>>>>
>>>> Thank you,
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>>>
>>>>> sessionInfo()
>>>>>
>>>>>
>>>>
>>>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>> LC_TIME=en_US.UTF-8
>>>>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
>>>> LC_MESSAGES=en_US.UTF-8
>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>> LC_ADDRESS=C
>>>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
>>>> LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] annotate_1.27.1      AnnotationDbi_1.11.4 Biobase_2.9.0
>>>> ShortRead_1.7.9
>>>> [5] Rsamtools_1.1.8      lattice_0.18-8       Biostrings_2.17.24
>>>> GenomicRanges_1.1.17
>>>> [9] IRanges_1.7.12
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] DBI_0.2-5     grid_2.12.0   hwriter_1.2   RSQLite_0.9-1 xtable_1.5-6
>>>> an
>>>>
>>>>
>>>
>>>
>
>



More information about the Bioc-sig-sequencing mailing list