[Bioc-sig-seq] a question about trimLRPatterns

Harris A. Jaffee hj at jhu.edu
Wed Aug 31 18:50:16 CEST 2011


Sending back to the list, since others may be confused also.

On Aug 31, 2011, at 11:48 AM, wang peter wrote:
> DEAR HARRIS:
>            I am shan, thank you very much for your kindly help.
> but i am still confused about the function of trimLRPatterns.
> like the example
> if i set
> > subject = "TTTACGT"
> > Lpattern = "TTTAACGT"
> the result is :
> > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
> max.Lmismatch=1,with.Lindels=TRUE)
> [1] ""
>
> but if i set
> > subject = "TTTACGT"
> > Lpattern = "AAATTTAACGT"
> the result is :
> > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
> max.Lmismatch=1,with.Lindels=TRUE)
> [1] "TTTACGT"
> how to explain it?

The problem is that max.Lmismatch is a vector that specifies one's  
mismatch tolerances for the
successive match tests of the Lpattern suffixes, at the beginning of  
the subject.  The vector
is expected to be of length nchar(Lpattern), with the element  
max.Lmismatch[i] controlling the
test for the suffix of length i.  If a shorter vector is supplied, as  
you did here (you give a
vector of length 1), the function expands that to a vector of length  
nchar(Lpattern) by filling
with -1's at the *low end*.  Your 1 becomes the last element of this  
vector in both cases above.
This 1 is sufficient for "TTTAACGT" to match "TTTACGT" in the context  
of with.Lindels=TRUE, but
it is not enough for "AAATTTAACGT" to match the same subject.  You  
would need 4 edits (deletions
of A) for that:

 > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
max.Lmismatch=3, with.Lindels=T)
[1] "TTTACGT"

 > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
max.Lmismatch=4, with.Lindels=T)
[1] ""

On the other hand, you can trim the entire subject a different way,  
allowing for only 1 edit,
by employing the 4_th longest suffix of Lpattern, namely "TTTAACGT".   
The commands below show
that 1 edit is not enough to trim the whole subject using the *3_rd  
longest* Lpattern suffix,
namely "ATTTAACGT" (for which you would need 2 edits!):

 > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
max.Lmismatch=rep(1,3), with.Lindels=TRUE)
[1] "TTTACGT"

 > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
max.Lmismatch=rep(1,4), with.Lindels=TRUE)
[1] ""

# allows for 2 edits, for the 3 longest pattern suffixes:
 > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
max.Lmismatch=rep(2,3), with.Lindels=TRUE)
[1] ""

# shows exactly where the 2 is needed (for the 3_rd longest suffix):
 > trimLRPatterns(Lpattern = Lpattern, subject = subject,  
max.Lmismatch=c(2,0,0), with.Lindels=TRUE)
[1] ""

To see the R code for trimLRPatterns, do

 > showMethods("trimLRPatterns", includeDefs=TRUE)

and

 > Biostrings:::.XStringSet.trimLRPatterns

and (for Lpattern)

 > Biostrings:::.computeTrimStart

Also see  ?which.isMatchingStartingAt

> and do you know how to read the c source code of trimLRPatterns

Start with the function XString_match_pattern_at() on

	Biostrings/src/lowlevel_matching.c

This is called by .matchPatternAt() on R/lowlevel-matching.R.

> thank u very much
> shan gao



More information about the Bioc-sig-sequencing mailing list