[BioC] fast iterator over DNAString's?
Hervé Pagès
hpages at fhcrc.org
Thu Mar 11 19:51:18 CET 2010
One more thing. If you want exact matches, the sub() solution is
*much* faster. Here the main advantage of using a Biostring-based
solution like lefttrimFromFirstPatternStart() is if you want to
support mismatches/indels, which, AFAIK, can't easily be described
with a regular expression.
H.
Hervé Pagès wrote:
> Hi Paul,
>
> Paul Shannon wrote:
>> I wish to trim a variable length sequence from the end of many
>> thousands of DNAStrings in a DNAStringSet.
>> The sequence to be trimmed is any recognizable chunk of a solexa short
>> read adapter, which ends up on the end of, for example, 22nt miRNAs.
>> The adapter chunk might be found in the middle of a 35 base read, or
>> it might be closer to the end. In every case, I want to delete every
>> base from the start of the adapter chunk to the end of the read.
>>
>> I imagine there might be a BString operation equivalent to sed. See
>> could be used ike this:
>>
>> echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed
>> s/TCGTATGCCGTC.*$// --> GAAGCGGGATGATCTATC
>
> With standard string functions:
>
> x <- "CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT"
> pattern <- "TCGTATGCCGTC"
>
> > sub(paste(pattern, ".*$", sep=""), "", x)
> [1] "CGAAGCGGGATGATCTATC"
>
> Note that the 1st letter ("C") has disappeared in your example.
>
>>
>> (where TCGTATGCCGTC is only part of the 21-base adapter, but is
>> probably a long enough portion to be representative)
>>
>> Any way to do this with BStrings and friends?
>
> Here is the solution to Steve's exercise. As he suggested, the
> building blocks are vmatchPattern(), startIndex() and subseq<-():
>
> lefttrimFromFirstPatternStart <- function(pattern, subject)
> {
> mi <- vmatchPattern(pattern, subject)
> start_at <- sapply(startIndex(mi), function(start) start[1L])
> start_at[is.na(start_at)] <- width(subject)[is.na(start_at)] + 1L
> subseq(subject, start=start_at) <- ""
> subject
> }
>
> Note that it could be that the pattern has 0 or more than 1 hit in a
> given subject element. The above function tries to accommodate this.
>
> Then:
>
> x <- DNAStringSet(c(x, paste(rep.int("A", 35), collapse="")))
>
> > x
> A DNAStringSet instance of length 2
> width seq
> [1] 35 CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT
> [2] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
> > lefttrimFromFirstPatternStart(pattern, x)
> A DNAStringSet instance of length 2
> width seq
> [1] 19 CGAAGCGGGATGATCTATC
> [2] 35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
> Also note that vmatchPattern() lets the user have some control over
> mismatches/indels (see ?vmatchPattern) so you can add this level
> of control to lefttrimFromFirstPatternStart() too by adding
> the 'max.mismatch', 'min.mismatch', 'with.indels' and 'fixed'
> arguments to it (with the same default values as in vmatchPattern)
> and just passing their values to vmatchPattern().
>
> Cheers,
> H.
>
>
>>
>> Thanks!
>>
>> - Paul
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list