[Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)

Cei Abreu-Goodger cei at ebi.ac.uk
Fri Jul 23 17:02:26 CEST 2010


Why not use bowtie? (I think someone mentioned this earlier for a 
similar post?)

Ivan Gregoretti wrote:
> Hello Patrick,
> 
> The idea of vcountPattern is good but it does not quite work for two reasons
> 
> 1) mySeq is ~40kb. That is quite big and vcountPattern() throws the error
> 
>> vcountPattern(mySeq, sread(A))
> Error in .valid.algos(pattern, max.mismatch, min.mismatch, with.indels,  :
>   patterns with more than 20000 letters are not supported
> 
> 2) vcountPattern is designed to find a motif (small) contained in a
> genome (large), like this
> vcountPattern("GCCACCAGGGGGCGC", Mmusculus)
> 
> In my case, I have millions of motifs (the 36 bp tags) that I have to
> find if they are contained in my single ~40kb. Its like a reverse
> scenario. So, if I try reversing the arguments, I also get an error:
> 
>> vcountPattern(sread(A), mySeq)
> Error in normargPattern(pattern, subject) :
>   'pattern' must be a single string or an XString object
> 
> Any more suggestions?
> 
> Thank you,
> 
> Ivan
> 
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-03-25 r51410)
> x86_64-unknown-linux-gnu
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=C
> LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] annotate_1.27.1      AnnotationDbi_1.11.4 Biobase_2.9.0
> ShortRead_1.7.9
> [5] Rsamtools_1.1.8      lattice_0.18-8       Biostrings_2.17.24
> GenomicRanges_1.1.17
> [9] IRanges_1.7.12
> 
> loaded via a namespace (and not attached):
> [1] DBI_0.2-5     grid_2.12.0   hwriter_1.2   RSQLite_0.9-1 xtable_1.5-6
> an
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list