[BioC] Count differences between sequences

Sun Mar 28 23:23:19 CEST 2010

On 03/28/2010 01:17 PM, Martin Morgan wrote:
> Hi Erik et al.,
> 
> Not offering anything practical, but...
> 
> On 03/28/2010 11:39 AM, erikwright at comcast.net wrote:
>> Hi Patrick, Michael, HervÃ©, and Martin,

> f0 <- function(dna)
> {
>     l <- unique(width(dna))
>     w <- length(dna)
> 
>     dna0 <- factor(unlist(strsplit(as.character(dna), "")))
>     idx <- seq(1, l * w, l)  # 'column' offsets
>     i <- seq_len(w)
> 
>     d <- matrix(l, w, w)
>     for (j in seq_len(l)-1L) {
>         grp <- split(i, dna0[j+idx]) # a 'column' of nucleotides
> ##         for (k in grp)
> ##             d[k, k] = d[k, k] - 1L
>     }
>     d
> }
> 
> Internally, 'split' does its work in 2N comparisons. This important part
> is that comparisons are accomplished in linear rather than polynomial time.

> same nucleotide (i.e., in k). This is unfortunately the slow part of the
> revised code, and it is polynomial time -- in the best case, for 4
> equally frequent nucleotides, we do 4 * ( (N/4)^2 ) = N^2 / 4 updates,
> and in the worst case (all nucleotides identical) we do N^2 updates. We
> pay a heavy price for doing this at the R level, and basically get
> swamped by the update rather than sort cost.

> But we shouldn't lose track of the fact that we can do the sorting much
> more efficiently. I've also looked a little at sort.list(,
> method="radix"), which orders a factor very quickly.

dampening my enthusiasm a bit here, as even implemented in C the update
operation still dominates, and the algorithm remains polynomial.

Martin