[BioC] a problem of trimLRPatterns still confused me
Harris A. Jaffee
hj at jhu.edu
Sat Dec 1 15:34:51 CET 2012
See below.
On Nov 30, 2012, at 3:36 PM, Wang Peter wrote:
> thank you very much, Harris,you helped me again
>
> now i understand, see the below
>
> max.mismatchs <- 0.2*1:nchar(Rpattern)
> subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG"
>
> Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG"
>
> sapply((nchar(subject)-nchar(Rpattern)+1):nchar(subject), function(j) {
> s = substr(subject, j, nchar(subject))
> p = substr(Rpattern, 1, nchar(subject)-j+1)
> neditEndingAt(ending.at=nchar(s), pattern = p, subject = s,
> with.indels=TRUE)
> })
>
> all distance
> [1] 32 33 33 32 31 32 31 30 29 28 27 26 27 26 25 25 24 23 22 22 21 20 20 20
> [25] 20 19 18 17 18 17 17 18 17 16 15 16 15 14 13 12 12 11 10 9 8 7 6 6
> [49] 6 6 6 5 4 3 (2) 3 3 3 3 3 2 1 0 1
>
> max.mismatchs
> [1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 (2.0) 2.2 2.4 2.6
> 2.8 3.0 3.2 3.4 3.6 3.8
> [20] 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4
> 6.6 6.8 7.0 7.2 7.4 7.6
> [39] 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2
> 10.4 10.6 10.8 11.0 11.2 11.4
> [58] 11.6 11.8 12.0 12.2 12.4 12.6 12.8
>
> when the function find a distance < = the corresponding mismatch. see
> (2) and (2.0), the function stops.
Yes.
> but i think the distance between those 10bp kmer should be 4, not 2
>
> CAAGATC AAG
> AGATCGGAAG
You are correct about the edit-distance between the 2 strings of length
10, but that is not relevant here. trimLRPatterns is based on
?`lowlevel-matching`:
If 'with.indels' is 'TRUE', then the "edit distance" is used: for
each position specified in 'at', P is compared to all the
substrings S' of S starting at this position and the smallest
distance is returned.
This needs to be read as applying, inverted, to 'ending at' situations.
In your case, _this position_ is the *end* of the the 10-letter subject S.
The 10-letter pattern P is compared to all the substrings S' of S *ending*
at the specified position. The smallest distance, 2, is the one between P
and the 8-letter suffix S' of S. Now it is possible that longer suffixes
S' of S would have the same distance 2 from P, and the *longest* of those
would be desirable. I believe the low-level matching code actually finds
this longest match (without any mechanism, at present, to report it).
More information about the Bioconductor
mailing list