[Bioc-devel] vmatchPattern Returns Out of Bounds Indices

Hervé Pagès hpages at fredhutch.org
Thu Nov 17 10:48:58 CET 2016


Hi,

These questions really belong to the support site.

On 11/16/2016 04:00 PM, Dario Strbenac wrote:
> Hello,
>
> If using vmatchPattern to find a sequence in another sequence, the resulting end index can be beyond the length of the subject XStringSet. For example:
>
> forwardPrimer <- "TCTTGTGGAAAGGACGAAACACCG"
>> range(width(reads))
> [1] 75 75
> primerEnds <- vmatchPattern(forwardPrimer, reads, max.mismatch = 1)
>> range(unlist(endIndex(primerEnds))
> [1] 23 76
>
> This causes problems if using extractAt to obtain the sequences within each read.

That is the expected behavior for matchPattern() and family when
max.mismatch != 0. This was a conscious choice. If you don't
allow indels, the matches will always have the length of the pattern.
This allows straightforward one-to-one mapping between the letters
in the pattern and their positions on the subject, which is a nice
property. For example, if the pattern is ACGT but the reported match
is range 1-3, you don't have an easy way to know how the 3 nucleotides
in the pattern are actually mapped to the subject.

Truncating the match when it goes out of bounds actually means
clipping the pattern. You'll get that behavior by allowing indels:

   > matchPattern("AAACCTT", "AAAAAAAAAAACCT")
     Views on a 14-letter BString subject
   subject: AAAAAAAAAAACCT
   views: NONE

   > matchPattern("AAACCTT", "AAAAAAAAAAACCT", max.mismatch=1)
     Views on a 14-letter BString subject
   subject: AAAAAAAAAAACCT
   views:
       start end width
   [1]     9  15     7 [AAACCT ]

   > matchPattern("AAACCTT", "AAAAAAAAAAACCT", max.mismatch=1, 
with.indels=TRUE)
     Views on a 14-letter BString subject
   subject: AAAAAAAAAAACCT
   views:
       start end width
   [1]     9  14     6 [AAACCT]

But then of course you loose the one-to-one mapping between the
letters in the pattern and their positions on the subject.

The 2 other alternatives are (1) to get rid of the out of bounds
matches or (2) to trim them. For example, to trim them:

   subjects <- DNAStringSet(c("AACCTTTTTT", "AAAAAAAAAAACCT"))
   m <- vmatchPattern("AAACCTT", subjects, max.mismatch=1)
   m
   # MIndex object of length 2
   # [[1]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #  [1]         0         6         7
   #
   # [[2]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #   [1]         9        15         7

   m2 <- IRangesList(start=pmax(start(m), 1),
                     end=pmin(end(m), width(subjects)))
   m2
   # IRangesList of length 2
   # [[1]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #   [1]         1         6         6
   #
   # [[2]]
   # IRanges object with 1 range and 0 metadata columns:
   #           start       end     width
   #       <integer> <integer> <integer>
   #   [1]         9        14         6

Then:

   extractAt(subjects, m2)
   # DNAStringSetList of length 2
   # [[1]] AACCTT
   # [[2]] AAACCT

H.

> For example:
>
> sequences = extractAt(reads, locations)
> Error in .normarg_at2(at, x) :
>   some ranges in 'at' are off-limits with respect to their corresponding sequence
>   in 'x'
>
> It's rare, but still a problem, nonetheless.
>
>> table(unlist(endIndex(primerLocations)) >  75)
>
>  FALSE   TRUE
> 366225      2
>
> This happens with Biostrings 2.42.0.
>
> --------------------------------------
> Dario Strbenac
> University of Sydney
> Camperdown NSW 2050
> Australia
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list