[BioC] fast iterator over DNAString's?

Thu Mar 11 19:51:18 CET 2010

One more thing. If you want exact matches, the sub() solution is
*much* faster. Here the main advantage of using a Biostring-based
solution like lefttrimFromFirstPatternStart() is if you want to
support mismatches/indels, which, AFAIK, can't easily be described
with a regular expression.

H.

Hervé Pagès wrote:
> Hi Paul,
> 
> Paul Shannon wrote:
>> I wish to trim a variable length sequence from the end of many 
>> thousands of DNAStrings in a DNAStringSet. 
>> The sequence to be trimmed is any recognizable chunk of a solexa short 
>> read adapter, which ends up on the end of, for example, 22nt miRNAs.  
>> The adapter chunk might be found in the middle of a 35 base read, or 
>> it might be closer to the end.  In every case, I want to delete every 
>> base from the start of the adapter chunk to the end of the read.
>>
>> I imagine there might be a BString operation equivalent to sed.  See 
>> could be used ike this:
>>
>>   echo 'CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT' | sed 
>> s/TCGTATGCCGTC.*$//      --> GAAGCGGGATGATCTATC
> 
> With standard string functions:
> 
>   x <- "CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT"
>   pattern <- "TCGTATGCCGTC"
> 
>   > sub(paste(pattern, ".*$", sep=""), "", x)
>   [1] "CGAAGCGGGATGATCTATC"
> 
> Note that the 1st letter ("C") has disappeared in your example.
> 
>>
>> (where TCGTATGCCGTC is only part of the 21-base adapter, but is 
>> probably a long enough portion to be representative)
>>
>> Any way to do this with BStrings and friends?
> 
> Here is the solution to Steve's exercise. As he suggested, the
> building blocks are vmatchPattern(), startIndex() and subseq<-():
> 
>   lefttrimFromFirstPatternStart <- function(pattern, subject)
>   {
>     mi <- vmatchPattern(pattern, subject)
>     start_at <- sapply(startIndex(mi), function(start) start[1L])
>     start_at[is.na(start_at)] <- width(subject)[is.na(start_at)] + 1L
>     subseq(subject, start=start_at) <- ""
>     subject
>   }
> 
> Note that it could be that the pattern has 0 or more than 1 hit in a
> given subject element. The above function tries to accommodate this.
> 
> Then:
> 
>   x <- DNAStringSet(c(x, paste(rep.int("A", 35), collapse="")))
> 
>   > x
>     A DNAStringSet instance of length 2
>       width seq
>   [1]    35 CGAAGCGGGATGATCTATCTCGTATGCCGTCTTCT
>   [2]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> 
>   > lefttrimFromFirstPatternStart(pattern, x)
>     A DNAStringSet instance of length 2
>       width seq
>   [1]    19 CGAAGCGGGATGATCTATC
>   [2]    35 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> 
> Also note that vmatchPattern() lets the user have some control over
> mismatches/indels (see ?vmatchPattern) so you can add this level
> of control to lefttrimFromFirstPatternStart() too by adding
> the 'max.mismatch', 'min.mismatch', 'with.indels' and 'fixed'
> arguments to it (with the same default values as in vmatchPattern)
> and just passing their values to vmatchPattern().
> 
> Cheers,
> H.
> 
> 
>>
>> Thanks!
>>
>>  - Paul
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319