[Bioc-sig-seq] trimLRPatterns question

Harris A. Jaffee hj at jhu.edu
Sat Oct 30 14:38:51 CEST 2010


On Oct 30, 2010, at 12:46 AM, Kunbin Qu wrote:
> I tried that, it did not work (and I do not quite understand why  
> you suggest Rpattern="TGGT" or "TGAC"):
>
>> t1<-DNAString("AAAATGGTAC")
>> t2<-DNAString("TGGT")
>> t3<-trimLRPatterns(Rpattern=t2,subject=t1)
>> t3
>   10-letter "DNAString" instance
> seq: AAAATGGTAC
>> t3<-trimLRPatterns(Rpattern=t2,subject=t1, max.Rmismatch=rep(0,4))
>> t3
>   10-letter "DNAString" instance
> seq: AAAATGGTAC
>> t2<-DNAString("TGAC")
>> trimLRPatterns(Rpattern=t2,subject=t1, max.Rmismatch=rep(0,4))
>   10-letter "DNAString" instance
seq: AAAATGGTAC

I got the impression that you wanted your pattern to start with TG  
but wanted any
trimming confined to letters 7-10 of your t1 (especially leaving the  
TG at 5-6).
Either of "TGGT" or "TGAC" might suffice depending on the mismatch  
limits.

For example, TGAC becomes TG|GT|AC by inserting GT, and it also  
becomes AC after
deleting TG, so the following call determines that there is a  
permissible fuzzy
match for TGAC, ending at the end of the subject, and it trims 4  
letters:

 > trimLRPatterns(Rpattern="TGAC", subject="AAAATGGTAC",  
with.Rindels=T, max.Rmismatch=2)
[1] "AAAATG"

The supporting C code finds the AC match, but doesn't return that  
information.  We
currently don't use the heavier pairwiseAlignment machinery, which  
could return all
possible lengths of matches, especially the maximum (6: TGGTAC).  So  
trimLRPatterns
trims 4 letters as an approximation for the range [2, 6] of possible  
match lengths.

It would seem contrary to your wishes, but I have lobbied for using  
the maximum of
that range, so TGGTAC would get trimmed, or maybe as a compromise,  
half-way toward
that maximum, so trim GGTAC.  I have experimented with a new option  
of this sort:

	0 <= "R.extra.trim" <= max.Rmismatch

which can be implemented by post-processing (after using ranges=TRUE  
internally).

We should talk privately if I still don't understand your goal.

> If you mean to allow trimming a pattern like "TG.." in letters 7-10
> or less of your subject, even
> if TG occurs in letters 5-6, as in your t1, then you might want
> Rpattern="TGGT", and max.Rmismatch
> should have length 4 (or less, if you want to prevent trimming less
> than 4 letters).
>
> If I didn't get the point, a longer answer:
>
> Usually, I over-trimmed when I set the mismatch limits too high
> (especially with indels allowed),
> probably when I used the "rate" feature for the mismatch parameter,
> and that resulted in certain
> mismatch elements being larger than I wanted.  But you have zero
> mismatches allowed, and your whole
> Rpattern occurs exactly, as the final 6 letters of your subject.  So
> your example is a good example
> of what the function does (except that you don't need to allow
> indels, and you could get your t4 as
> the value of trimLRPatterns by using ranges=F, the default).  You can
> put -1 in the mismatch vector,
> for example
>
> 	M = some integer
> 	trimLRPatterns(Rpattern=t2, subject=t1, max.Rmismatch = c(rep(-1,5),
> M))
>
> This allows the 6-letter suffix of your subject to be trimmed if it
> matches your Rpattern with M
> or less errors, but it prevents trimming by anything shorter.
> Actually, your rep(0,4) is the same
> as c(rep(-1,2), rep(0,4)).  That is, trimming by 6 or 5 letters is
> allowed, but not by less.  Thus,
>
>> trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",
> max.Rmismatch=0)
> [1] "AAAA"
>
>> trimLRPatterns(Rpattern="TGGTAC", subject="AAAATGGTAC",
> max.Rmismatch=c(rep(-1,5), 0))
> [1] "AAAA"
>
> You can even put -1 as the 6th element of max.Rmismatch, but then you
> might as well start with
> Rpattern="TGGTA"; or -1 in positions 5 and 6, so, Rpattern="TGGT".
>
>
>
>
>
> ______________________________________________________________________
> The contents of this electronic message, including any attachments,  
> are intended only for the use of the individual or entity to which  
> they are addressed and may contain confidential information. If you  
> are not the intended recipient, you are hereby notified that any  
> use, dissemination, distribution, or copying of this message or any  
> attachment is strictly prohibited. If you have received this  
> transmission in error, please send an e-mail to  
> postmaster at genomichealth.com and delete this message, along with  
> any attachments, from your computer.



More information about the Bioc-sig-sequencing mailing list