[BioC] a question about the low level match function

Harris A. Jaffee hj at jhu.edu
Tue Nov 6 22:02:06 CET 2012


On Nov 6, 2012, at 3:02 PM, wang peter wrote:
> dear ALL, harry and steve:
>     i am so sorry to disturb you again.but this time,i read the mannu
> and some source coding carefully. but still confused with the process
> how trimLRPatterns works?
>     i trace back to the function
> 
> Biostrings:::.computeTrimEnd

The relevant statement is

    ii <- which.isMatchingEndingAt(pattern = Rpattern, subject = subject, 
        ending.at = subject_width, max.mismatch = max.Rmismatch, 
        with.indels = with.Rindels, fixed = Rfixed, auto.reduce.pattern = TRUE)

'subject_width' is constant at this time, because of this earlier test:

    if (!isConstant(width(subject))) {
        tmp <- .computeTrimStart(reverse(Rpattern), reverse(subject), 
            max.Rmismatch, with.Rindels, Rfixed)
        return(width(subject) - tmp + 1L)
    }

auto.reduce.pattern=TRUE tells the *EndingAt function to test a vector of
patterns against each subject element subject to the 'max.mismatch' vector
of edit distance limits.  These patterns are constructed behind the scenes
(in C) from your single 'pattern=Rpattern'.  For example, if your Rpattern
was "TCGGAA", the test patterns would be, in order,

"TCGGAA"
 "TCGGA"
  "TCGG"
   "TCG"
    "TC"
     "T"

They are tested using 'ending.at=subject_width', as I've hinted by the way
I've written them.  The "which" in the function name is associated with its
underlying code (in this case, C code) stopping at the first hit, subject to
your edit limits.  For example, if a subject element happens to end with
"TCGGA" within your limits, the test loop for that subject element stops.

> showMethods(which.isMatchingEndingAt, includeDefs=TRUE)
> Biostrings:::.matchPatternAt
> 
>    if (is(subject, "XString"))
>        .Call2("XString_match_pattern_at", pattern, subject,
>            at, at.type, max.mismatch, min.mismatch, with.indels,
>            fixed, ans.type, auto.reduce.pattern, PACKAGE = "Biostrings")
>    else .Call2("XStringSet_vmatch_pattern_at", pattern, subject,
>        at, at.type, max.mismatch, min.mismatch, with.indels,
>        fixed, ans.type, auto.reduce.pattern, PACKAGE = "Biostrings")
> 
> i think it will call the low level coding.

Yes, these are calls to C.  'at.type' is set to 1L by all the *EndingAt
functions (and to 0L by all the *StartingAt functions).  The statement
above in .computeTrimEnd supplies 'ending.at', namely the subject width,
which is sent as the 'at' argument of .matchPatternAt and forwarded to C.

> for example:
> trimLRPatterns(Rpattern = Rpattern, subject = subject,
> max.Rmismatch=0.1, with.Lindels=TRUE)
> 
> subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA"
> Rpattern =              "GAATAGTACTGTAGGCACCATCAATAGATCGGAA"
> 
> then the function will change max.Rmismatch to
> max.Rmismatch= as.integer(max.Rmismatch*1:nchar(Rpattern))
> [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> 
> as i know the process is,it try to get the distance between p and s
> 
> p = "GAATAGTACTGTAGGCACCATCAATAGATCGGAA" allowing 3 mismatch
> s = "GAATAGTACTGTAGGCACCATCAATAGATCGGAA"
> 
> p = "AATAGTACTGTAGGCACCATCAATAGATCGGAA"  allowing 3 mismatch
> s = "GAATAGTACTGTAGGCACCATCAATAGATCGGA"
> ...
> p = "A"  allowing 0 mismatch
> s = "G"
> 
> but what does the parameter at mean?

See 'at' and 'ending.at' above.  Does this help?

> -- 
> shan gao
> Room 231(Dr.Fei lab)
> Boyce Thompson Institute
> Cornell University
> Tower Road, Ithaca, NY 14853-1801
> Office phone: 1-607-254-1267(day)
> Official email:sg839 at cornell.edu
> Facebook:http://www.facebook.com/profile.php?id=100001986532253
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list