[Bioc-devel] Biostrings: ambiguous matching for 'compareStrings'

Hervé Pagès hpages at fhcrc.org
Sat Mar 16 00:22:57 CET 2013

Hi Julian,

On 03/15/2013 08:27 AM, Julian Gehring wrote:
> Hi,
> When I want to compare two aligned sequences of the same length and find
> the mismatching positions, 'compareStrings' from the 'Biostrings'
> package seems to be the best choice.  However, it does take into account
> any ambiguous matches (as matching any base to 'N'), as it can be found
> with the 'fixed' for 'matchPattern' and related functions.  Would it be
> reasonable to include this, to ensure a consistent behavior of methods
> for 'XString' and related objects?

Adding the 'fixed' arg to compareStrings() sounds like a good
suggestion to me. Note that it would only make sense to use
compareStrings() with 'fixed=FALSE' if the 2 sequences were
actually aligned with 'fixed=FALSE'. So from now I'm assuming
that this is how your sequences were aligned i.e. with a tool
that supports 'fixed=FALSE'.

However, since no alignment tools in Biostrings that supports
'fixed=FALSE' also supports indels (but see [1]), I would also
assume that your aligned sequences don't contain indels. Unless
your sequences were aligned with an external tool?

So if that's the case (i.e. no dashes in your aligned sequences),
and if you're only interested in the number of mismatches, you can
use neditAt():

   > s1 <- DNAString("AANTTTWCCW")
   > s2 <- DNAString("ACATTGSCCV")
   > neditAt(s1, s2)
   [1] 5
   > neditAt(s1, s2, fixed=FALSE)
   [1] 3

If you really want the position of the mismatches, here is an ugly
and inefficient trick:

   > mapply(function(x, y) neditAt(x, y, fixed=FALSE),
            DNAStringSet(s1, seq_along(s1), width=1),
            DNAStringSet(s2, seq_along(s2), width=1))
  [1] 0 1 0 0 0 1 1 0 0 0

then coerce to logical with as.logical() and apply which() on the
result to get the positions of the mismatches.

The fact that there is not easy/efficient workaround (and that
those workarounds only work correctly if the aligned sequences
don't contain indels), would make the addition of the 'fixed'
arg to compareStrings() even more valuable. Just added to my
TODO list.


[1] At least not at the moment, but I still have to check-in a
     patch from H. Jaffee that would allow matchPattern() and
     family to support with.indels=TRUE and fixed=FALSE at the
     same time. Hopefully I manage to get to this before the
     upcoming release. BTW thanks Harris for your infinite patience
     on this matter :-)

> Best wishes
> Julian
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioc-devel mailing list