[Bioc-sig-seq] a question about trimLRPatterns
Harris A. Jaffee
hj at jhu.edu
Wed Aug 31 22:44:59 CEST 2011
On Aug 31, 2011, at 3:20 PM, wang peter wrote:
dear Harris A. Jaffee,
> hj at jhu.edu
>
> thank u very much for your explaination.
> but i am still confused because of not knowing its algorithm
> could u tell me if i can read them from lowlevel_matching.c
>
> let me try to concluded what u have told me:
>
> 1. the algorithm will change every max.Rmismatch to a vector.
> by adding -1 if it is not 0.
> e.g.
> if max.Lmismatch=0 it will do : max.Lmismatch = rep(0, nchar
> (Lpattern))
> if max.Lmismatch=c(2,2) it will get 2 2 -1 -1 -1
> how does the algorithm use -1 or 0 to control the alignment??
>
See ?trimLRPatterns. Yes, 0 is a special value, as is any scalar in
[0,1).
These are interpreted as "rates", relative to the length of the
substring
of the pattern to be tested. (See ?agrep) For Lpattern, such a
scalar s
is converted internally to
as.integer(s * 1:nchar(Lpattern))
The order of this vector of max.mismatch limits is counter-intuitive
(and
wasn't my idea). A sanity check is that it should be monotone
increasing,
not necessarily strictly, as with the above construction using the
rate s.
The suffixes of the Lpattern are tested in the order longest first, with
the first match winning. [They originally were tested shortest
first, and
this explains the ordering of the vector, at least from the
implementation
perspective. But in this scenario, you cannot know when you are
done, so
you inevitably waste time.] As I said, the mismatch limit for the
test of
the pattern suffix of length i is max.Lmismatch[i], hence the
elements of
the vector max.Lmismatch are used from the top down.
As I said, max.Lmismatch values shorter than nchar(Lpattern) are padded
with -1's at the low end. E.g., contrary to what you wrote above, c
(2,2)
will be expanded into -1, -1, .... 2, 2.
See ?isMatchingAt -- especially the Details section. 0 as a
max.mismatch
limit means that only perfect matches are acceptable (which rules out
matches involving edits of indels); -1 as a max.mismatch limit
*prevents*
any match, since any edit distance is non-negative.
> 2. the algorithm will try any suffix segment on Lmismatch by
> substr(Lpattern, i, nchar(Lpattern)), i = 1:nchar(Lpattern)
> e.g. Lpattern = "TTTAACGT"
> it will try "T","GT",......
>
Yes, but in the opposite order: "TTTAACGT", "TTAACGT", .... "GT", "T".
>
> 3. during each try, if the length of segment is equal to the
> lenght of
> subject, the algoritm will stop, except it can allow indel.
>
I'm not sure what you mean with the stopping.
> 4. and indel should be counted as a mismatch
>
If indels are permitted and used, each letter added or removed is
counted
as 1 toward the edit distance. See the examples on ?isMatchingAt .
> subject = "TTTACGT"
> Lpattern = "TTTAACGT" # pattern has an extra 'A'
>
> # need to allow for 1 "edit", to remove the extra A
> > trimLRPatterns(Lpattern = Lpattern, subject = subject,
> max.Lmismatch=1,
> with.Lindels=TRUE)
> [1] ""
>
> trimLRPatterns(Lpattern = Lpattern, subject = subject,
> max.Lmismatch=4)
> [1] ""
> i think when the algorithm take "TTAACGT", never try "TTTAACGT",
> because no indel allowed.
>
>
More information about the Bioc-sig-sequencing
mailing list