[BioC] Distance btw sequences from different sets

Tue Mar 19 04:33:13 CET 2013

Hi Herve,

Hamming Distance is actually my interest (didn't use it on my example
as I was trying to save some typing).

Thank you so much for your help, will definitely put your solution to
good use. For me, it'd be useful to have such functionality in
Biostrings....

With best wishes,
benilton

2013/3/19 Hervé Pagès <hpages at fhcrc.org>:
> Hi Benilton,
>
>
> On 03/18/2013 05:20 PM, Benilton Carvalho wrote:
>>
>> Hi,
>>
>> I was wondering what would be the most efficient strategy to get distances
>> between sequences that belong to two different sets. So far, what I am
>> doing is:
>>
>> library(Biostrings)
>> set.seed(1)
>> alph = DNA_ALPHABET[1:4]
>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6)
>> set1 = apply(set1, 1, paste, collapse='')
>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4)
>> set2 = apply(set2, 1, paste, collapse='')
>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist))
>
>
> Note that by default, stringDist() computes the Levenshtein distance
> aka "the edit distance" between strings. This is controlled via the
> 'method' argument. With the toy data you generate above, where all
> the strings have the same length, it often makes sense to use the
> Hamming distance instead i.e. to count the nb of mismatches between
> strings. This is of course up to the user.
>
> However it's important to keep in mind that computing the Hamming
> distance is much faster than computing the Levenshtein distance.
> No surprise: the algo for computing the former is trivial, but the
> algo for computing the latter uses a complicated and expensive
> dynamic programming approach.
>
> That being said, and back to your question, if you don't mind
> calculating a little bit more distances than necessary (and thus
> also using more memory than necessary), you can call stringDist()
> on the big set formed by putting the 2 different sets together,
> and then extract only the part of the big matrix that corresponds
> to the outer product of the 2 original sets:
>
>   outerStringDists1 <- function(x, y, method="levenshtein")
>   {
>     d <- stringDist(c(x, y), method=method)
>     m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)]
>     if (storage.mode(m) != storage.mode(d))
>         storage.mode(m) <- storage.mode(d)
>     dimnames(m) <- NULL
>     t(m)
>   }
>
> With this solution, there are more loops (looping happens inside
> stringDist()) but the looping is much faster (it's done in C).
> In the best case, i.e. when the 2 sets have the same size, this
> will be hundred times faster than using outer() + apply(). Also,
> in that case, the big intermediate matrix will be "only" 4 times
> bigger than the final matrix, which is probably a price that is
> OK to pay for that kind of speed-up. Nonetheless the situation
> will quickly degrade if one set is much bigger than the other,
> e.g. an order of magnitude bigger or more. For example, it would
> not be efficient at all to use outerStringDists1() if 1 set had
> 100k strings and the other 100.
>
> If you are in this situation, and if you want to compute the Hamming
> distance, then here is a solution that uses sapply() over the smallest
> of the 2 sets:
>
>   outerStringDists2 <- function(x, y)
>   {
>     if (!is(x, "XStringSet"))
>         x <- as(x, "XStringSet")
>     if (!is(y, "XStringSet"))
>         y <- as(y, "XStringSet")
>     if (length(x) < length(y))
>         return(t(outerStringDists2(y, x)))
>     if (length(unique(c(width(x), width(y)))) != 1L)
>         stop("strings in 'x' and 'y' must have the same length")
>     sapply(seq_along(y), function(j) neditAt(y[[j]], x))
>   }
>
> Still faster than the outer() + apply() solution but can only compute
> the Hamming distance.
>
> Sounds like maybe we should have an outerStringDists() function in
> Biostrings?
>
> Cheers,
> H.
>
>
>>
>> Thanks a lot for any suggestion,
>> benilton
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319