[BioC] Distance btw sequences from different sets

Tue Mar 19 04:09:06 CET 2013

Hi Benilton,

On 03/18/2013 05:20 PM, Benilton Carvalho wrote:
> Hi,
>
> I was wondering what would be the most efficient strategy to get distances
> between sequences that belong to two different sets. So far, what I am
> doing is:
>
> library(Biostrings)
> set.seed(1)
> alph = DNA_ALPHABET[1:4]
> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6)
> set1 = apply(set1, 1, paste, collapse='')
> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4)
> set2 = apply(set2, 1, paste, collapse='')
> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist))

Note that by default, stringDist() computes the Levenshtein distance
aka "the edit distance" between strings. This is controlled via the
'method' argument. With the toy data you generate above, where all
the strings have the same length, it often makes sense to use the
Hamming distance instead i.e. to count the nb of mismatches between
strings. This is of course up to the user.

However it's important to keep in mind that computing the Hamming
distance is much faster than computing the Levenshtein distance.
No surprise: the algo for computing the former is trivial, but the
algo for computing the latter uses a complicated and expensive
dynamic programming approach.

That being said, and back to your question, if you don't mind
calculating a little bit more distances than necessary (and thus
also using more memory than necessary), you can call stringDist()
on the big set formed by putting the 2 different sets together,
and then extract only the part of the big matrix that corresponds
to the outer product of the 2 original sets:

   outerStringDists1 <- function(x, y, method="levenshtein")
   {
     d <- stringDist(c(x, y), method=method)
     m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)]
     if (storage.mode(m) != storage.mode(d))
         storage.mode(m) <- storage.mode(d)
     dimnames(m) <- NULL
     t(m)
   }

With this solution, there are more loops (looping happens inside
stringDist()) but the looping is much faster (it's done in C).
In the best case, i.e. when the 2 sets have the same size, this
will be hundred times faster than using outer() + apply(). Also,
in that case, the big intermediate matrix will be "only" 4 times
bigger than the final matrix, which is probably a price that is
OK to pay for that kind of speed-up. Nonetheless the situation
will quickly degrade if one set is much bigger than the other,
e.g. an order of magnitude bigger or more. For example, it would
not be efficient at all to use outerStringDists1() if 1 set had
100k strings and the other 100.

If you are in this situation, and if you want to compute the Hamming
distance, then here is a solution that uses sapply() over the smallest
of the 2 sets:

   outerStringDists2 <- function(x, y)
   {
     if (!is(x, "XStringSet"))
         x <- as(x, "XStringSet")
     if (!is(y, "XStringSet"))
         y <- as(y, "XStringSet")
     if (length(x) < length(y))
         return(t(outerStringDists2(y, x)))
     if (length(unique(c(width(x), width(y)))) != 1L)
         stop("strings in 'x' and 'y' must have the same length")
     sapply(seq_along(y), function(j) neditAt(y[[j]], x))
   }

Still faster than the outer() + apply() solution but can only compute
the Hamming distance.

Sounds like maybe we should have an outerStringDists() function in
Biostrings?

Cheers,
H.

>
> Thanks a lot for any suggestion,
> benilton
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319