[Bioc-sig-seq] Applying grep to a large number of tags. (looking for speed)

Patrick Aboyoun paboyoun at fhcrc.org
Fri Jul 23 16:06:55 CEST 2010


Ivan,
Have you tried Biostrings::vmatchPattern?


Patrick


On 7/23/10 7:02 AM, Ivan Gregoretti wrote:
> Hello listers,
>
> How do you count how many of your Illumina tags match a particular sequence?
>
>
>
> Well, actually, I know at least two ways of doing this but they are
> both slow. This is perhaps the fastest one:
>
> library(ShortRead)
>
> # my sequence of interest (in my real case it is several kbs)
> mySeq<- 'AGTGAAGAAGCAAAGAGGCCTCTGCAAGTCACTCAAGAAGTCTACGATTTACATAGCTGACATA'
>
> # load my tags keeping only the ones that pass the filter
> A<- readAligned("/my/path", "s_5_export.txt.gz", "SolexaExport",
> filter=compose(alignDataFilter(expression(filtering=="Y"))))
>
> # keep only the tags that do not contain Ns
> A<- clean(A)
>
> # count how many tags match exactly some portion of mySeq
> sum_A<- sum(unlist(sapply(sread(A), function(x)grep(x, mySeq))))
>
>
> This has been running since yesterday and a similar strategy using
> srdistance has been running for 4 days. Can anybody suggest a more
> time-efficient calculation?
>
> Thank you,
>
> Ivan
>
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>



More information about the Bioc-sig-sequencing mailing list