[Bioc-sig-seq] trimLRPatterns question

Sat Oct 30 04:14:35 CEST 2010

On Oct 29, 2010, at 6:17 PM, Kunbin Qu wrote:
> Is there a way to NOT trim a segment in the pattern from  
> trimLRPatterns? For example:
>
> t1<-DNAString("AAAATGGTAC")
> t2<-DNAString("TGGTAC")
> t3<-trimLRPatterns(Rpattern=t2, subject=t1, max.Rmismatch = rep 
> (0,4), ranges=T, with.Rindels=TRUE)
> t4<-DNAStringSet(t1, start=start(t3), end=end(t3))
>
>> t4
>   A DNAStringSet instance of length 1
>     width seq
> [1]     4 AAAA
>> t1
>   10-letter "DNAString" instance
> seq: AAAATGGTAC
>> t2
>   6-letter "DNAString" instance
> seq: TGGTAC
>>
>
> If I want to keep "TG" in t1, followed by the AAAA, which also  
> appears in the beginning of the pattern t2, is there a way? The rep 
> (0,4) does not help since it could find a perfect match at length 6  
> (the first try when starting to trim). I am trying to use "TG" as  
> my anchor points to avoid false over-trimming. Thanks.

If you mean to allow trimming a pattern like "TG.." in letters 7-10  
or less of your subject, even
if TG occurs in letters 5-6, as in your t1, then you might want  
Rpattern="TGGT", and max.Rmismatch
should have length 4 (or less, if you want to prevent trimming less  
than 4 letters).

If I didn't get the point, a longer answer:

Usually, I over-trimmed when I set the mismatch limits too high  
(especially with indels allowed),
probably when I used the "rate" feature for the mismatch parameter,  
and that resulted in certain
mismatch elements being larger than I wanted.  But you have zero  
mismatches allowed, and your whole
Rpattern occurs exactly, as the final 6 letters of your subject.  So  
your example is a good example
of what the function does (except that you don't need to allow  
indels, and you could get your t4 as
the value of trimLRPatterns by using ranges=F, the default).  You can  
put -1 in the mismatch vector,
for example

	M = some integer
	trimLRPatterns(Rpattern=t2, subject=t1, max.Rmismatch = c(rep(-1,5),  
M))

This allows the 6-letter suffix of your subject to be trimmed if it  
matches your Rpattern with M
or less errors, but it prevents trimming by anything shorter.   
Actually, your rep(0,4) is the same
as c(rep(-1,2), rep(0,4)).  That is, trimming by 6 or 5 letters is  
allowed, but not by less.  Thus,

 > trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",  
max.Rmismatch=0)
[1] "AAAA"

 > trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",  
max.Rmismatch=c(rep(-1,5), 0))
[1] "AAAA"

You can even put -1 as the 6th element of max.Rmismatch, but then you  
might as well start with
Rpattern="TGGTA"; or -1 in positions 5 and 6, so, Rpattern="TGGT".