[BioC] Distance btw sequences from different sets

Tue Mar 19 04:33:47 CET 2013

Thanks, Kasper, installing it now. b

2013/3/19 Kasper Daniel Hansen <kasperdanielhansen at gmail.com>:
> There is also a recent stringdist package on CRAN.  I have not used it though.
>
> Kasper
>
> On Mon, Mar 18, 2013 at 11:09 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>> Hi Benilton,
>>
>>
>> On 03/18/2013 05:20 PM, Benilton Carvalho wrote:
>>>
>>> Hi,
>>>
>>> I was wondering what would be the most efficient strategy to get distances
>>> between sequences that belong to two different sets. So far, what I am
>>> doing is:
>>>
>>> library(Biostrings)
>>> set.seed(1)
>>> alph = DNA_ALPHABET[1:4]
>>> set1 = matrix(sample(alph, 30, rep=TRUE), nr=6)
>>> set1 = apply(set1, 1, paste, collapse='')
>>> set2 = matrix(sample(alph, 20, rep=TRUE), nr=4)
>>> set2 = apply(set2, 1, paste, collapse='')
>>> outer(set1, set2, function(x, y) apply(cbind(x, y), 1, stringDist))
>>
>>
>> Note that by default, stringDist() computes the Levenshtein distance
>> aka "the edit distance" between strings. This is controlled via the
>> 'method' argument. With the toy data you generate above, where all
>> the strings have the same length, it often makes sense to use the
>> Hamming distance instead i.e. to count the nb of mismatches between
>> strings. This is of course up to the user.
>>
>> However it's important to keep in mind that computing the Hamming
>> distance is much faster than computing the Levenshtein distance.
>> No surprise: the algo for computing the former is trivial, but the
>> algo for computing the latter uses a complicated and expensive
>> dynamic programming approach.
>>
>> That being said, and back to your question, if you don't mind
>> calculating a little bit more distances than necessary (and thus
>> also using more memory than necessary), you can call stringDist()
>> on the big set formed by putting the 2 different sets together,
>> and then extract only the part of the big matrix that corresponds
>> to the outer product of the 2 original sets:
>>
>>   outerStringDists1 <- function(x, y, method="levenshtein")
>>   {
>>     d <- stringDist(c(x, y), method=method)
>>     m <- as.matrix(d)[length(x) + seq_along(y), seq_along(x)]
>>     if (storage.mode(m) != storage.mode(d))
>>         storage.mode(m) <- storage.mode(d)
>>     dimnames(m) <- NULL
>>     t(m)
>>   }
>>
>> With this solution, there are more loops (looping happens inside
>> stringDist()) but the looping is much faster (it's done in C).
>> In the best case, i.e. when the 2 sets have the same size, this
>> will be hundred times faster than using outer() + apply(). Also,
>> in that case, the big intermediate matrix will be "only" 4 times
>> bigger than the final matrix, which is probably a price that is
>> OK to pay for that kind of speed-up. Nonetheless the situation
>> will quickly degrade if one set is much bigger than the other,
>> e.g. an order of magnitude bigger or more. For example, it would
>> not be efficient at all to use outerStringDists1() if 1 set had
>> 100k strings and the other 100.
>>
>> If you are in this situation, and if you want to compute the Hamming
>> distance, then here is a solution that uses sapply() over the smallest
>> of the 2 sets:
>>
>>   outerStringDists2 <- function(x, y)
>>   {
>>     if (!is(x, "XStringSet"))
>>         x <- as(x, "XStringSet")
>>     if (!is(y, "XStringSet"))
>>         y <- as(y, "XStringSet")
>>     if (length(x) < length(y))
>>         return(t(outerStringDists2(y, x)))
>>     if (length(unique(c(width(x), width(y)))) != 1L)
>>         stop("strings in 'x' and 'y' must have the same length")
>>     sapply(seq_along(y), function(j) neditAt(y[[j]], x))
>>   }
>>
>> Still faster than the outer() + apply() solution but can only compute
>> the Hamming distance.
>>
>> Sounds like maybe we should have an outerStringDists() function in
>> Biostrings?
>>
>> Cheers,
>> H.
>>
>>
>>
>>>
>>> Thanks a lot for any suggestion,
>>> benilton
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor