[Bioc-sig-seq] a question about trimLRPatterns
Harris A. Jaffee
hj at jhu.edu
Wed Aug 31 22:44:59 CEST 2011
On Aug 31, 2011, at 3:20 PM, wang peter wrote:
dear Harris A. Jaffee,
> hj at jhu.edu
> thank u very much for your explaination.
> but i am still confused because of not knowing its algorithm
> could u tell me if i can read them from lowlevel_matching.c
> let me try to concluded what u have told me:
> 1. the algorithm will change every max.Rmismatch to a vector.
> by adding -1 if it is not 0.
> if max.Lmismatch=0 it will do : max.Lmismatch = rep(0, nchar
> if max.Lmismatch=c(2,2) it will get 2 2 -1 -1 -1
> how does the algorithm use -1 or 0 to control the alignment??
See ?trimLRPatterns. Yes, 0 is a special value, as is any scalar in
These are interpreted as "rates", relative to the length of the
of the pattern to be tested. (See ?agrep) For Lpattern, such a
is converted internally to
as.integer(s * 1:nchar(Lpattern))
The order of this vector of max.mismatch limits is counter-intuitive
wasn't my idea). A sanity check is that it should be monotone
not necessarily strictly, as with the above construction using the
The suffixes of the Lpattern are tested in the order longest first, with
the first match winning. [They originally were tested shortest
this explains the ordering of the vector, at least from the
perspective. But in this scenario, you cannot know when you are
you inevitably waste time.] As I said, the mismatch limit for the
the pattern suffix of length i is max.Lmismatch[i], hence the
the vector max.Lmismatch are used from the top down.
As I said, max.Lmismatch values shorter than nchar(Lpattern) are padded
with -1's at the low end. E.g., contrary to what you wrote above, c
will be expanded into -1, -1, .... 2, 2.
See ?isMatchingAt -- especially the Details section. 0 as a
limit means that only perfect matches are acceptable (which rules out
matches involving edits of indels); -1 as a max.mismatch limit
any match, since any edit distance is non-negative.
> 2. the algorithm will try any suffix segment on Lmismatch by
> substr(Lpattern, i, nchar(Lpattern)), i = 1:nchar(Lpattern)
> e.g. Lpattern = "TTTAACGT"
> it will try "T","GT",......
Yes, but in the opposite order: "TTTAACGT", "TTAACGT", .... "GT", "T".
> 3. during each try, if the length of segment is equal to the
> lenght of
> subject, the algoritm will stop, except it can allow indel.
I'm not sure what you mean with the stopping.
> 4. and indel should be counted as a mismatch
If indels are permitted and used, each letter added or removed is
as 1 toward the edit distance. See the examples on ?isMatchingAt .
> subject = "TTTACGT"
> Lpattern = "TTTAACGT" # pattern has an extra 'A'
> # need to allow for 1 "edit", to remove the extra A
> > trimLRPatterns(Lpattern = Lpattern, subject = subject,
>  ""
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
>  ""
> i think when the algorithm take "TTAACGT", never try "TTTAACGT",
> because no indel allowed.
More information about the Bioc-sig-sequencing